With the rapid development of LLMs, LLM-as-a-Judge has garnered widespread attention in both academia and industry. LLM judges are not only capable of serving as flexible evaluators in various fields such as text generation, question answering, and dialogue systems, but also facilitate the self-evolution and performance improvement of models. This repository aims to provide a one-stop resource for developers, researchers, and practitioners, helping them explore how to effectively leverage LLMs-as-Judges technology.
This repo include the papers discussed in our latest survey paper:
📝LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.
We will continuously track the latest developments in LLMs-as-Judges and regularly update the repository with the newest related papers. If you find this repository helpful, please give us a ⭐!
If you notice any work we've missed, please feel free to submit a pull request or contact us at email liht22@mails.tsinghua.edu.cn.
We will update the repository and our paper. Welcome to discuss and contribute!
Daily Papers on LLMs-as-Judges includes the latest paper titles and abstracts related to LLMs-as-Judges on arXiv, with information available in both English and Chinese.
🔥🔥 News: 2024/12/20: We have updated Daily Papers on LLMs-as-Judges, which automatically retrieves and updates daily papers from arXiv related to LLMs-as-Judges.
🔥🔥 News: 2024/12/14: We compiled papers related to LLMs-as-Judges presented at NeurIPS 2024.
🔥🔥 News: 2024/12/10: We released the first version of the full paper.
🔥🔥 News: 2024/11/10: We completed the foundational work for the project and structured the framework.
- 🚀 Awesome-LLMs-as-Judges
- 🌟 About This Repo
- 📚 Daily arXiv Papers on LLMs-as-Judges
- ⚡️ Update
- 🌳 Contents
- 📖 Cite Our Work
- 📚 Overview of Awesome-LLMs-as-Judges
- 📑 PaperList
- 1. Functionality
- 2. METHODOLOGY
- 3. APPLICATION
- 4. META-EVALUATION
- 5. LIMITATION
- 👏 Welcome to discussion
If you find our work useful, please do not save your star and cite our work:
@misc{li2024llmsasjudgescomprehensivesurveyllmbased,
title={LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods},
author={Haitao Li and Qian Dong and Junjie Chen and Huixue Su and Yujia Zhou and Qingyao Ai and Ziyi Ye and Yiqun Liu},
year={2024},
eprint={2412.05579},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.05579},
}
-
Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models
ACL 2023. [Paper]
-
Automated Genre-Aware Article Scoring and Feedback Using Large Language Models
arXiv 2024. [Paper]
-
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks
LREC-COLING 2024. [Paper]
-
Ares: An automated evaluation framework for retrieval-augmented generation systems
NAACL 2024. [Paper]
-
Self-rag: Learning to retrieve, generate, and critique through self-reflection
ICLR 2024. [Paper]
-
RecExplainer: Aligning Large Language Models for Explaining Recommendation Models
KDD 2024. [Paper]
-
Judging llm-as-a-judge with mt-bench and chatbot arena
NeurIPS 2023. [Paper]
-
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
arXiv 2024. [Paper]
-
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
arXiv 2024. [Paper]
-
Benchmarking foundation models with language-model-as-an-examiner
NeurIPS 2023. [Paper]
-
Kieval: A knowledge-grounded interactive evaluation framework for large language models
ACL 2024. [Paper]
-
Self-rewarding language models
ICML 2024. [Paper]
-
Direct language model alignment from online ai feedback
arXiv 2024. [Paper]
-
Rlaif: Scaling reinforcement learning from human feedback with ai feedback
arXiv 2024.[Paper]
-
Enhancing Reinforcement Learning with Dense Rewards from Language Model Critic
EMNLP 2024. [Paper]
-
Cream: Consistency regularized self-rewarding language models
arXiv 2024. [Paper]
-
The perfect blend: Redefining RLHF with mixture of judges
arXiv 2024. [Paper]
-
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
EMNLP (findings) 2023. [Paper]
-
Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
arXiv 2024. [Paper]
-
Fast Best-of-N Decoding via Speculative Rejection
NeurIPS 2024 [Paper]
-
Tree of thoughts: Deliberate problem solving with large language models
NeurIPS 2024. [Paper]
-
Graph of thoughts: Solving elaborate problems with large language models
AAAI 2024.[Paper]
-
Let’s verify step by step
ICLR 2024. [Paper]
-
Self-evaluation guided beam search for reasoning
NeurIPS 2024. [Paper]
-
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
arXiv 2024. [Paper]
-
Creative Beam Search: LLM-as-a-Judge for Improving Response Generation.
ICCC 2024. [Paper]
-
Self-refine: Iterative refinement with self-feedback
NeurIPS 2024. [Paper]
-
Teaching large language models to self-debug
arXiv 2023. [Paper]
-
Refiner: Reasoning feedback on intermediate representations
EACL 2024. [Paper]
-
Towards reasoning in large language models via multi-agent peer review collaboration
arXiv 2023. [Paper]
-
Large language models cannot self-correct reasoning yet
ICLR 2024. [Paper]
-
LLMs cannot find reasoning errors, but can correct them!
ACL (findings) 2024. [Paper]
-
Can large language models really improve by self-critiquing their own plans?
NeurIPS (Workshop) 2023. [Paper]
-
If in a Crowdsourced Data Annotation Pipeline, a GPT-4
CHI 2024. [Paper]
-
ChatGPT outperforms crowd workers for text-annotation tasks
PNAS 2023.[Paper]
-
ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning
arXiv 2023. [Paper]
-
Fullanno: A data engine for enhancing image comprehension of MLLMs
arXiv 2024. [Paper]
-
Can large language models aid in annotating speech emotional data? Uncovering new frontiers
IEEE 2024. [Paper]
-
Annollm: Making large language models to be better crowdsourced annotators
NAACL 2024. [Paper]
-
LLMAAA: Making large language models as active annotators
EMNLP (findings) 2023. [Paper]
-
Selfee: Iterative self-revising LLM empowered by self-feedback generation
Blog post 2023.[Blog]
-
Self-Boosting Large Language Models with Synthetic Preference Data
arXiv 2024. [Paper]
-
The fellowship of the LLMs: Multi-agent workflows for synthetic preference optimization dataset generation
arXiv 2024. [Paper]
-
Self-consistency improves chain of thought reasoning in language models
ICLR 2023. [Paper]
-
WizardLM: Empowering large language models to follow complex instructions
ICLR 2024. [Paper]
-
Automatic Instruction Evolving for Large Language Models
EMNLP 2024. [Paper]
-
STaR: Self-taught reasoner bootstrapping reasoning with reasoning
NeurIPS 2022. [Paper]
-
Beyond human data: Scaling self-training for problem-solving with language models
arXiv 2023. [Paper]
-
SAFER-INSTRUCT: Aligning Language Models with Automated Preference Data
NAACL 2024. [Paper]
-
soda-eval: open-domain dialogue evaluation in the age of llms
EMNLP (findings) 2024. [Paper]
-
A systematic survey of prompt engineering in large language models: Techniques and applications
arXiv 2024. [Paper]
-
A survey on in-context learning
arXiv 2022. [Paper]
-
Gptscore: Evaluate as you desire
arXiv 2023. [Paper]
-
Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models
NLP4ConvAI 2023. [Paper]
-
TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot
arXiv 2024. [Paper]
-
Multi-dimensional evaluation of text summarization with in-context learning
ACL 2023 (Findings). [Paper]
-
Calibrate before use: Improving few-shot performance of language models
ICML 2021. [Paper]
-
Prototypical calibration for few-shot learning of language models
arXiv 2022. [Paper]
-
Mitigating label biases for in-context learning
ACL 2023. [Paper]
-
ALLURE: auditing and improving llm-based evaluation of text using iterative in-context-learning
arXiv 2023. [Paper]
-
Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!
arXiv 2024. [Paper]
-
Chain-of-thought prompting elicits reasoning in large language models
NeurIPS 2022. [Paper]
-
Little giants: Exploring the potential of small llms as evaluation metrics in summarization in the eval4nlp 2023 shared task
Eval4NLP 2023. [Paper]
-
G-eval: Nlg evaluation using gpt-4 with better human alignment
arXiv 2023. [Paper]
-
ICE-Score: Instructing Large Language Models to Evaluate Code
EACL 2024 (findings). [Paper]
-
ProtocoLLM: Automatic Evaluation Framework of LLMs on Domain-Specific Scientific Protocol Formulation Tasks
arXiv 2024. [Paper]
-
A closer look into automatic evaluation using large language models
EMNLP 2023 (findings). [Paper]
-
FineSurE: Fine-grained summarization evaluation using LLMs
ACL 2024. [Paper]
-
Split and merge: Aligning position biases in large language model based evaluators
arXiv 2023. [Paper]
-
Can LLM be a Personalized Judge?
arXiv 2024. [Paper]
-
Biasalert: A plug-and-play tool for social bias detection in llms
arXiv 2024. [Paper]
-
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
arXiv 2024. [Paper]
-
Unveiling Context-Aware Criteria in Self-Assessing LLMs
arXiv 2024. [Paper]
-
Calibrating llm-based evaluator
arXiv 2023. [Paper]
-
Large Language Models Are Active Critics in NLG Evaluation
arXiv 2024. [Paper]
-
Kieval: A knowledge-grounded interactive evaluation framework for large language models
ACL 2024. [Paper]
-
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
arXiv 2024. [Paper]
-
Benchmarking foundation models with language-model-as-an-examiner
NeurIPS 2023 (Datasets and Benchmarks). [Paper]
-
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
arXiv 2024 [Paper]
-
On the limitations of fine-tuned judge models for llm evaluation
arXiv 2024. [Paper]
-
Adaptation with self-evaluation to improve selective prediction in llms
EMNLP 2023 (Findings). [Paper]
-
Learning personalized story evaluation
arXiv 2023. [Paper]
-
Improving Model Factuality with Fine-grained Critique-based Evaluator
arXiv 2024. [Paper]
-
Ares: An automated evaluation framework for retrieval-augmented generation systems
NAACL 2024. [Paper]
-
PHUDGE: Phi-3 as Scalable Judge
arXiv 2024. [Paper]
-
Self-Judge: Selective Instruction Following with Alignment Self-Evaluation
arXiv 2024. [Paper]
-
Automatic evaluation of attribution by large language models
EMNLP 2023 (Findings). [Paper]
-
Sorry-bench: Systematically evaluating large language model safety refusal behaviors
arXiv 2024. [Paper]
-
Tigerscore: Towards building explainable metric for all text generation tasks
TLMR 2024. [Paper]
-
Beyond Scalar Reward Model: Learning Generative Judge from Preference Data
arXiv 2024. [Paper]
-
Prometheus: Inducing fine-grained evaluation capability in language models
ICLR 2024. [Paper]
-
Prometheus 2: An open source language model specialized in evaluating other language models
arXiv 2024. [Paper]
-
FedEval-LLM: Federated Evaluation of Large Language Models on Downstream Tasks with Collective Wisdom
arXiv 2024. [Paper]
-
Self-rationalization improves LLM as a fine-grained judge
arXiv 2024. [Paper]
-
Foundational autoraters: Taming large language models for better automatic evaluation
arXiv 2024. [Paper]
-
Self-taught evaluators
arXiv 2024. [Paper]
-
Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization
ICLR 2024. [Paper]
-
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
arXiv 2024. [Paper]
-
Direct preference optimization: Your language model is secretly a reward model
NeurIPS 2023. [Paper]
-
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge
arXiv 2024. [Paper]
-
Judgelm: Fine-tuned large language models are scalable judges
arXiv 2023. [Paper]
-
INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback
EMNLP 2023. [Paper]
-
Generative judge for evaluating alignment
arXiv 2023. [Paper]
-
Shepherd: A critic for language model generation
arXiv 2023. [Paper]
-
X-eval: Generalizable multi-aspect text evaluation via augmented instruction tuning with auxiliary evaluation aspects
NAACL 2024. [Paper]
-
Themis: A reference-free nlg evaluation language model with flexibility and interpretability
EMNLP 2024. [Paper]
-
Critiquellm: Towards an informative critique generation model for evaluation of large language model generation
ACL 2024. [Paper]
-
Mitigating the Bias of Large Language Model Evaluation
arXiv 2024. [Paper]
-
Halu-j: Critique-based hallucination judge
arXiv 2024. [Paper]
-
Prometheusvision: Vision-language model as a judge for fine-grained evaluation
ICLR 2024 (Workshop). [Paper]
-
Llava-critic: Learning to evaluate multimodal models
arXiv 2024. [Paper]
-
Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons
arXiv 2024. [Paper]
-
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments
arXiv 2024. [Paper]
-
Language Models can Evaluate Themselves via Probability Discrepancy
ACL 2024 (Findings). [Paper]
-
Mitigating biases for instruction-following language models via bias neurons elimination
ACL 2024. [Paper]
-
Evaluation metrics in the era of GPT-4: reliably evaluating large language models on sequence to sequence tasks
EMNLP 2023. [Paper]
-
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
arXiv 2024. [Paper]
-
Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing
arXiv 2024. [Paper]
-
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
arXiv 2024. [Paper]
-
Generative judge for evaluating alignment
arXiv 2023. [Paper]
-
Self-evaluation improves selective generation in large language models
NeurIPS 2023 (Workshops). [Paper]
-
AI can help humans find common ground in democratic deliberation
Science 2024. [Paper]
-
Towards reasoning in large language models via multi-agent peer review collaboration
arXiv 2023. [Paper]
-
Wider and deeper llm networks are fairer llm evaluators
arXiv 2023. [Paper]
-
ABSEval: An Agent-based Framework for Script Evaluation
ACL 2024. [Paper]
-
Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates
arXiv 2024. [Paper]
-
A multi-llm debiasing framework
arXiv 2024. [Paper]
-
Prd: Peer rank and discussion improve large language model based evaluations
TMLR 2024. [Paper]
-
Chateval: Towards better llm-based evaluators through multi-agent debate
arXiv 2023. [Paper]
-
Evaluating the Performance of Large Language Models via Debates
arXiv 2024. [Paper]
-
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
arXiv 2024. [Paper]
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
arXiv 2024. [Paper]
-
Benchmarking foundation models with language-model-as-an-examiner
NeurIPS 2023 (Datasets and Benchmarks). [Paper]
-
Pre: A peer review based large language model evaluator
arXiv 2024. [Paper]
-
An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation
arXiv 2024. [Paper]
-
Large language models as evaluators for recommendation explanations
Proceedings of the 18th ACM Conference on Recommender Systems. [Paper]
-
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
EMNLP 2024. [Paper]
-
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
arXiv 2024. [Paper]
-
Fusion-Eval: Integrating Assistant Evaluators with LLMs
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. [Paper]
-
AIME: AI System Optimization via Multiple LLM Evaluators
arXiv 2024. [Paper]
-
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition
arXiv 2024. [Paper]
-
An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers
arXiv 2024. [Paper]
-
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation
EMNLP 2024. [Paper]
-
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
arXiv 2024. [Paper]
-
PiCO: Peer Review in LLMs based on the Consistency Optimization
arXiv 2024. [Paper]
-
Language Model Preference Evaluation with Multiple Weak Evaluators
arXiv 2024. [Paper]
-
Collaborative Evaluation: Exploring the Synergy of Large Language Models and Humans for Open-ended Generation Evaluation
arXiv 2023. [Paper]
-
Large language models are not fair evaluators
arXiv 2023. [Paper]
-
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
EMNLP 2024. [Paper]
-
Human-Centered Design Recommendations for LLM-as-a-judge
arXiv 2024. [Paper]
-
Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences
Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. [Paper]
-
DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset
IJCNLP 2017.[Poster]
-
Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization
EMNLP 2018. [Paper]
-
Improving LLM-based machine translation with systematic self-correction
arXiv 2024. [Paper]
-
Fusion-Eval: Integrating Assistant Evaluators with LLMs
EMNLP 2024.[Poster]
-
Llava-critic: Learning to evaluate multimodal models
arXiv 2024. [Paper]
-
Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark
ICML 2024. [Paper]
-
Can large language models aid in annotating speech emotional data? uncovering new frontiers
IEEE 2024. [Paper]
-
Efficient Self-Improvement in Multimodal Large Language Models: A Model-Level Judge-Free Approach
arXiv 2024. [Paper]
-
Calibrated self-rewarding vision language models
NeurIPS 2024. [Paper]
-
Automated evaluation of large vision-language models on self-driving corner cases
arXiv 2024. [Paper]
-
DOCLENS: Multi-aspect fine-grained evaluation for medical text generation
ACL 2024.[Paper]
-
Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?
NAACL (findings) 2024. [Paper]
-
Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation
arXiv 2024. [Paper]
-
Automatic evaluation for mental health counseling using LLMs
arXiv 2024. [Paper]
-
Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models
Bioinformatics 2024. [Paper]
-
Disc-lawllm: Fine-tuning large language models for intelligent legal services
arXiv 2023. [Paper]
-
Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA NNLP (Workshop) 2023. 132–137. [Paper]
-
Constructing domain-specific evaluation sets for llm-as-a-judge
customnlp4u (Workshop) 2024. [Paper]
-
Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval
arXiv 2024. [Paper]
-
Pixiu: A large language model, instruction data and evaluation benchmark for finance
NeurIPS 2023. [Paper]
-
GPT classifications, with application to credit lending
Machine Learning with Applications 2024. [Paper]
-
KRX Bench: Automating Financial Benchmark Creation via Large Language Models
FinNLP 2024. [Paper]
-
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course
EMNLP 2024. [Paper]
-
Automated Genre-Aware Article Scoring and Feedback Using Large Language Models
arXiv 2024. [Paper]
-
Automated Essay Scoring and Revising Based on Open-Source Large Language Models
IEEE 2024. [Paper]
-
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks.
LREC-COLING 2024. [Paper]
-
Evaluating Mathematical Reasoning Beyond Accuracy
COLM 2024. [Paper]
-
Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM
ACL (findings) 2024. [Paper]
-
LLMJudge: LLMs for Relevance Judgments
LLM4Eval 2024. [Paper]
-
Don’t Use LLMs to Make Relevance Judgments
arXiv 2024. [Paper]
-
JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking
arXiv 2024. [Paper]
-
Large language models as evaluators for recommendation explanations
RecSys 2024. [Paper]
-
Ares: An automated evaluation framework for retrieval-augmented generation systems
NAACL 2024. [Paper]
-
AIME: AI System Optimization via Multiple LLM Evaluators
arXiv 2024. [Paper]
-
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
arXiv 2024. [Paper]
-
LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization
arXiv 2024. [Paper]
-
Using Large Language Models to Evaluate Biomedical Query-Focused Summarisation
Biomedical NLP (Workshop) 2024.[Paper]
-
AI can help humans find common ground in democratic deliberation
Science 2024.[Paper]
-
Sotopia: Interactive evaluation for social intelligence in language agents
ICLR (spotlight) 2024. [Paper]
-
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
ICLR 2024 [Paper]
-
CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
arXiv 2024 [Paper]
-
Evaluating Large Language Models Trained on Code
arXiv 2021[Paper]
-
Agent-as-a-Judge: Evaluate Agents with Agents
arXiv 2024 [Paper]
-
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
NeurIPS 2023 (Datasets and Benchmarks Track) [Paper ]
-
Experts, errors, and context: A large-scale study of human evaluation for machine translation
TACL 2021 [Paper]
-
Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-Based Human Evaluations on TED and News Domain
Proceedings of the Sixth Conference on Machine Translation (WMT), 2021 [Paper]
-
Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist
Proceedings of the Sixth Conference on Machine Translation (WMT), 2023 [Paper]
-
Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics
NAACL 2021 [Paper]
-
SummEval: Re-evaluating Summarization Evaluation
Transactions of the Association for Computational Linguistics (TACL), 2021 [Paper]
-
Opinsummeval: Revisiting automated evaluation for opinion summarization
EMNLP 2023 [Paper]
-
Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations
INTERSPEECH 2019 [Paper]
-
Automatic evaluation and moderation of open-domain dialogue systems
DSTC10 [Paper]
-
Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?
ACL 2018 [Paper]
-
USR: An Unsupervised and Reference-Free Evaluation Metric for Dialog Generation
ACL 2020 [Paper]
-
Overview of the Tenth Dialog System Technology Challenge: DSTC10
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023 [Paper]
-
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
ACL 2021 [Paper]
-
Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
COLING 2022 [Paper]
-
Learning Personalized Story Evaluation
ICLR 2024 [Paper]
-
Hierarchical Neural Story Generation
ACL 2018 [Paper]
-
A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories
NAACL 2016 [Paper]
-
StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning
EMNLP 2022 [Paper]
-
A general language assistant as a laboratory for alignment
arXiv 2021 [Paper]
-
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
arXiv 2024 [Paper]
-
Cvalues: Measuring the values of Chinese large language models from safety to responsibility
arXiv 2023 [Paper]
-
Large Language Models as Evaluators for Recommendation Explanations
RecSys 2024 [Paper] -
Yelp Dataset Challenge: Review Rating Prediction
arXiv 2016 [Paper]
-
The movielens datasets: History and context
TiiS 2016 [Paper]
-
Lecardv2: A large-scale chinese legal case retrieval dataset
SIGIR 2024 [Paper]
-
Overview of the TREC 2021 Deep Learning Track
TREC 2021 [Paper]
-
Overview of the TREC 2023 NeuCLIR Track
TREC 2023 [Paper]
-
Ms MARCO: A human generated machine reading comprehension dataset
ICLR 2017 [Paper]
-
Overview of the TREC 2022 Deep Learning Track
Text REtrieval Conference (TREC) 2022 [Paper]
-
Length-controlled alpacaeval: A simple way to debias automatic evaluators
COLM 2024 [Paper]
-
Helpsteer: Multi-attribute helpfulness dataset for steerlm
NAACL 2024 [Paper]
-
ULTRAFEEDBACK: Boosting Language Models with Scaled AI Feedback
ICML 2024 [Paper]
-
Helpsteer2-preference: Complementing ratings with preferences
CoRR 2024 [Paper]
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
EMNLP 2023 [Paper]
-
RewardBench: Evaluating Reward Models for Language Modeling
arXiv preprint, March 2024 [Paper]
-
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
ICLR 2024 [Paper]
-
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
ICLR 2025 [Paper]
-
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
ICML 2024 [Paper]
-
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
arXiv 2024 [Paper]
-
TruthfulQA: Measuring How Models Mimic Human Falsehoods
arXiv 2021 [Paper]
-
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
arXiv 2024[ Paper]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
NeurIPS 2023 [Paper]
-
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
arXiv 2024 [Paper]
-
JudgeBench: A Benchmark for Evaluating LLM-based Judges
arXiv 2024 [Paper]
-
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90% ChatGPT Quality*
LMSYS Org 2023 [Paper]
-
Pearson correlation coefficient
Philosophical Transactions of the Royal Society of London, 1895 [Paper]
-
Spearman’s rank correlation coefficient
The American Journal of Psychology, 1904 [Paper]
-
Estimates of the regression coefficient based on Kendall's tau
Journal of the American Statistical Association, 1968[Paper]
-
The Intraclass Correlation Coefficient as a Measure of Reliability
Psychological reports, 1966 [Paper]
-
Five ways to look at Cohen's kappa
Journal of Psychology & Psychotherapy, 2015 [Paper]
-
Large language models are not robust multiple choice selectors
ICLR 2024 [Paper]
-
Look at the first sentence: Position bias in question answering
EMNLP 2020 [Paper]
-
Batch calibration: Rethinking calibration for in-context learning and prompt engineering
ICLR 2024 [Paper]
-
Large Language Models Are Zero-Shot Rankers for Recommender Systems
ECIR 2024 [Paper ]
-
Position bias in multiple-choice questions
Journal of Marketing Research, 1984 [Paper ]
-
JurEE not Judges: safeguarding llm interactions with small, specialised Encoder Ensembles
arXiv preprint, October 2024 [Paper ]
-
Split and merge: Aligning position biases in large language model based evaluators
EMNLP 2024 [Paper]
-
Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs
arXiv 2024 [Paper]
-
Large Language Models are not Fair Evaluators
ACL 2024 [Paper]
-
REDUCING SELECTION BIAS IN LARGE LANGUAGE MODELS
arXiv 2024 [Paper]
-
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
arXiv 2024 [Paper]
-
Large language models are not fair evaluators
ACL 2024 [Paper]
-
Debating with more persuasive LLMs leads to more truthful answers
International Conference on Learning Representations (ICLR) 2024 [Paper]
-
Position bias estimation for unbiased learning to rank in personal search
ACM International Conference on Web Search and Data Mining (WSDM) 2018[ Paper]
-
Humans or LLMs as the judge? A study on judgement biases
EMNLP 2024 [Paper]
-
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
EMNLP 2024 [Paper]
-
Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments
arXiv preprint, June 2024 [Paper]
-
Generative judge for evaluating alignment
International Conference on Learning Representations (ICLR) 2024 [Paper]
-
Justice or prejudice? quantifying biases in llm-as-a-judge
NeurIPS 2024 Workshop SafeGenAi 2024 [Paper]
-
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Findings of ACL 2024 [Paper]
-
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
SFLLM Workshop @ NeurIPS 2024 [Paper]
-
Humans or LLMs as the Judge? A Study on Judgement Bias
EMNLP 2024 [Paper]
-
mind vs. mouth: on measuring re-judge inconsistency of social bias in large language models
arXiv 2024 [Paper]
-
Calibrate Before Use: Improving Few-Shot Performance of Language Models
ICML 2021[ Paper]
-
Mitigating Label Biases for In-Context Learning
ACL 2023 [Paper]
-
Prototypical Calibration for Few-Shot Learning of Language Models
ICLR 2023 [Paper]
-
Bias Patterns in the Application of LLMs for Clinical Decision Support: A Comprehensive Study
arXiv preprint 2024 [Paper]
-
Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering
ICLR 2024 [Paper ]
-
Large Language Models Can Be Easily Distracted by Irrelevant Context
ICML 2023 [Paper]
-
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text
ACL ARR 2024 [Paper]
-
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
arXiv preprint, July 2024 [Paper ]
-
Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement
ACL 2024 [Paper]
-
Humans or LLMs as the Judge? A Study on Judgement Biases
arXiv preprint 2024 [Paper]
-
Evaluations of Self and Others: Self-Enhancement Biases in Social Judgments
Social Cognition 1986 [Paper]
-
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Findings of ACL 2024 [Paper]
-
Debating with More Persuasive LLMs Leads to More Truthful Answers
ICML 2024 [Paper]
-
HotFlip: White-Box Adversarial Examples for Text Classification
ACL 2018[ Paper]
-
Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization
ICML 2022 [Paper]
-
Adv-BERT: BERT is Not Robust on Misspellings! Generating Natural Adversarial Samples on BERT
arXiv 2020 [Paper]
-
An LLM Can Fool Itself: A Prompt-Based Adversarial Attack
ICLR 2024 [Paper]
-
Natural Backdoor Attack on Text Data
arXiv 2020 [Paper]
-
Ignore Previous Prompt: Attack Techniques for Language Models
arXiv 2022 [Paper]
-
Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks
arXiv 2023 [Paper]
-
Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples
arXiv 2022[ Paper]
-
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
SafeGenAI @ NeurIPS 2024 [Paper]
-
Optimization-based Prompt Injection Attack to LLM-as-a-Judge
arXiv 2024 [Paper]
-
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
EMNLP 2024 [Paper]
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
NeurIPS 2023 Datasets and Benchmarks Track [Paper]
-
Scaling Instruction-Finetuned Language Models
Journal of Machine Learning Research 2024 [Paper]
-
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
EMNLP 2024 [Paper]
-
Retrieval-Augmented Generation for Large Language Models: A Survey
arXiv 2023 [Paper]
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
NeurIPS 2020 [Paper]
-
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning
arXiv 2023 [Paper]
-
Continual Learning for Large Language Models: A Survey
arXiv 2024 [Paper]
-
Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review
MISDOOM 2024 [Paper]
-
Striking the Balance in Using LLMs for Fact-Checking: A Narrative Literature Review
MISDOOM 2024 [Paper]
-
Survey of Hallucination in Natural Language Generation
ACM COMPUTING SURVEYS 2024 [Paper]
-
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
arXiv 2024 [Paper]
-
Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs
arXiv 2023 [Paper]
-
Unifying Large Language Models and Knowledge Graphs: A Roadmap
arXiv 2023 [Paper]
-
Retrieval-Augmented Generation for Large Language Models: A Survey
arXiv 2023 [Paper]
-
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks
arXiv 2024 [Paper]
-
Limits to scalable evaluation at the frontier: LLM as Judge won’t beat twice the data
arXiv 2024 [Paper]
-
Measuring the Inconsistency of Large Language Models in Ordinal Preference Formation
KnowLLM (Workshop) 2024 [Paper]
We welcome anyone interested to engage in friendly communication with us!