A curated reading list for large language model (LLM) alignment. Take a look at our new survey "Large Language Model Alignment: A Survey" on arXiv for more details!
Feel free to open an issue/PR or e-mail thshen@tju.edu.cn and dyxiong@tju.edu.cn if you find any missing areas, papers, or datasets. We will keep updating this list and survey.
If you find our survey useful, please kindly cite our paper:
@article{shen2023alignment,
title={Large Language Model Alignment: A Survey},
author={Shen, Tianhao and Jin, Renren and Huang, Yufei and Liu, Chuang and Dong, Weilong and Guo, Zishan and Wu, Xinwei and Liu, Yan and Xiong, Deyi},
journal={arXiv preprint arXiv:2309.15025},
year={2023}
}
- llm-alignment-survey
- Aligning Large Language Models with Human: A Survey. Yufei Wang et al. arXiv 2023. [Paper]
- Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment. Yang Liu et al. arXiv 2023. [Paper]
- Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation. Patrick Fernandes et al. arXiv 2023. [paper]
- Augmented Language Models: a Survey. Grégoire Mialon et al. arXiv 2023. [Paper]
- An Overview of Catastrophic AI Risks. Dan Hendrycks et al. arXiv 2023. [Paper]
- A Survey of Large Language Models. Wayne Xin Zhao et al. arXiv 2023. [Paper]
- A Survey on Universal Adversarial Attack. Chaoning Zhang et al. IJCAI 2021. [Paper]
- Survey of Hallucination in Natural Language Generation. Ziwei Ji et al. ACM Computing Surveys 2022. [Paper]
- Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. Liangming Pan et al. arXiv 2023. [Paper]
- Automatic Detection of Machine Generated Text: A Critical Survey. Ganesh Jawahar et al. COLING 2020. [Paper]
- Synchromesh: Reliable Code Generation from Pre-trained Language Models. Gabriel Poesia et al. ICLR 2022. [Paper]
- LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. Chan Hee Song et al. ICCV 2023. [Paper]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. Wenlong Huang et al. PMLR 2022. [Paper]
- Tool Learning with Foundation Models. Yujia Qin et al. arXiv 2023. [Paper]
- Ethical and social risks of harm from Language Models. Laura Weidinger et al. arXiv 2021. [Paper]
- Predictive Biases in Natural Language Processing Models: A Conceptual Framework and Overview. Deven Shah et al. arXiv 2019. [Paper]
- RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. arXiv 2023. [Paper]
- Extracting Training Data from Large Language Models. Nicholas Carlini et al. arXiv 2012. [Paper]
- StereoSet: Measuring stereotypical bias in pretrained language models. Moin Nadeem et al. arXiv 2020. [Paper]
- CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
- HONEST: Measuring Hurtful Sentence Completion in Language Models. Debora Nozza et al. NAACL 2021. [Paper]
- Language Models are Few-Shot Learners. Tom Brown et al. NeurIPS 2020. [Paper]
- Persistent Anti-Muslim Bias in Large Language Models. Abubakar Abid et al. AIES 2021. [Paper]
- Gender and Representation Bias in GPT-3 Generated Stories. Li Lucy et al. WNU 2021. [Paper]
- Measuring and Improving Consistency in Pretrained Language Models. Yanai Elazar et al. TACL 2021. [Paper]
- GPT-3 Creative Fiction. Gwern. 2023. [Blog]
- GPT-3: What’s It Good for? Robert Dale. Natural Language Engineering 2020. [Paper]
- Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Jack W. Rae et al. arXiv 2021. [Paper]
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. ACL 2022. [Paper]
- Towards Tracing Knowledge in Language Models Back to the Training Data. Ekin Akyurek et al. EMNLP 2020. [Paper]
- Sparks of Artificial General Intelligence: Early experiments with GPT-4. Sébastien Bubeck et al. arXiv 2023. [Paper]
- Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models. Kaitlyn Zhou et al. arXiv 2023. [Paper]
- Patient and Consumer Safety Risks When Using Conversational Assistants for Medical Information: An Observational Study of Siri, Alexa, and Google Assistant. Reza Asadi et al. 2018. [Paper]
- Will ChatGPT Replace Lawyers? Kate Rattray. 2023. [Blog]
- Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. arXiv 2022. [Paper]
- Truth, Lies, and Automation How Language Models Could Change Disinformation. Ben Buchanan et al. Center for Security and Emerging Technology, 2021. [Paper]
- Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models. Alex Tamkin et al. arXiv 2021. [Paper]
- Deal or No Deal? End-to-End Learning for Negotiation Dialogues. Mike Lewis et al. arXiv 2017. [Paper]
- Evaluating Large Language Models Trained on Code. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
- Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools. Jonas B. Sandbrink. arXiv 2023. [Paper]
- Sustainable AI: AI for sustainability and the sustainability of AI. Aimee van Wynsberghe. AI and Ethics 2021. [Paper]
- Unraveling the Hidden Environmental Impacts of AI Solutions for Environment. Anne-Laure Ligozat et al. arXiv 2021. [Paper]
- GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models. Tyna Eloundou et al. arXiv 2023. [Paper]
- Formalizing Convergent Instrumental Goals. Tsvi Benson-Tilsen et al. AAAI AIES Workshop 2016. [Paper]
- Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
- Aligning AI Optimization to Community Well-Being. Stray J. International Journal of Community Well-being 2020. [Paper]
- What are you optimizing for? Aligning Recommender Systems with Human Values. Jonathan Stray et al. ICML 2020. [Paper]
- Model evaluation for extreme risks. Toby Shevlane et al. arXiv 2023. [Paper]
- Human-level play in the game of Diplomacy by combining language models with strategic reasoning. Meta Fundamental AI Research Diplomacy Team (FAIR) et al. Science 2022. [Paper]
- Characterizing Manipulation from AI Systems. Micah Carroll et al. arXiv 2023. [Paper]
- Deceptive Alignment Monitoring. Andres Carranza et al. ICML AdvML Workshop 2023. [Paper]
- The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
- Is Power-Seeking AI an Existential Risk? Joseph Carlsmith. arXiv 2023. [Paper]
- Optimal Policies Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2021. [Paper]
- Parametrically Retargetable Decision-Makers Tend To Seek Power. Alexander Matt Turner et al. NeurIPS 2022. [Paper]
- Power-seeking can be probable and predictive for trained agents. Victoria Krakovna et al. arXiv 2023. [Paper]
- Discovering Language Model Behaviors with Model-Written Evaluations. Ethan Perez et al. arXiv 2022. [Paper]
- Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers. Norbert Wiener. Science 1960. [Paper]
- Coherent Extrapolated Volition. Eliezer Yudkowsky. Singularity Institute for Artificial Intelligence 2004. [Paper]
- The Basic AI Drives. Stephen M. Omohundro. AGI 2008. [Paper]
- The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents. Nick Bostrom. Minds and Machines 2012. [Paper]
- General Purpose Intelligence: Arguing the Orthogonality Thesis. Stuart Armstrong. Analysis and Metaphysics 2013. [Paper]
- Aligning Superintelligence with Human Interests: An Annotated Bibliography. Nate Soares. Intelligence 2015. [Paper]
- Concrete Problems in AI Safety. Dario Amodei et al. arXiv 2016. [Paper]
- The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
- AI Safety Gridworlds. Jan Leike et al. arXiv 2017. [Paper]
- Overview of Current AI Alignment Approaches. Micah Carroll. 2018. [Paper]
- Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2019. [Paper]
- An Overview of 11 Proposals for Building Safe Advanced AI. Evan Hubinger. arXiv 2020. [Paper]
- Unsolved Problems in ML Safety. Dan Hendrycks et al. arXiv 2021. [Paper]
- A Mathematical Framework for Transformer Circuits. Nelson Elhage et al. Transformer Circuits Thread 2021. [Paper]
- Alignment of Language Agents. Zachary Kenton et al. arXiv 2021. [Paper]
- A General Language Assistant as a Laboratory for Alignment. Amanda Askell et al. arXiv 2021. [Paper]
- A Transparency and Interpretability Tech Tree. Evan Hubinger. 2022. [Blog]
- Understanding AI Alignment Research: A Systematic Analysis. J. Kirchner et al. arXiv 2022. [Paper]
- Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
- The Alignment Problem from a Deep Learning Perspective. Richard Ngo. arXiv 2022. [Paper]
- Paradigms of AI Alignment: Components and Enablers. Victoria Krakovna. 2022. [Blog]
- Progress Measures for Grokking via Mechanistic Interpretability. Neel Nanda et al. arXiv 2023. [Paper]
- Agentized LLMs Will Change the Alignment Landscape. Seth Herd. 2023. [Blog]
- Language Models Can Explain Neurons in Language Models. Steven Bills et al. 2023. [Paper]
- Core Views on AI Safety: When, Why, What, and How. Anthropic. 2023. [Blog]
- Proximal Policy Optimization Algorithms. John Schulman et al. arXiv 2017. [Paper]
- Fine-Tuning Language Models from Human Preferences. Daniel M Ziegler et al. arXiv 2019. [Paper]
- Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
- Training Language Models to Follow Instructions with Human Feedback. Long Ouyang et al. NeurIPS 2022. [Paper]
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. Yuntao Bai et al. arXiv 2022. [Paper]
- RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. Afra Feyza Akyürek et al. arXiv 2023. [Paper]
- Improving Language Models with Advantage-Based Offline Policy Gradients. Ashutosh Baheti et al. arXiv 2023. [Paper]
- Scaling Laws for Reward Model Overoptimization. Leo Gao et al. ICML 2023. [Paper]
- Improving Alignment of Dialogue Agents via Targeted Human Judgements. Amelia Glaese et al. arXiv 2022. [Paper]
- Aligning Language Models with Preferences through F-Divergence Minimization. Dongyoung Go et al. arXiv 2023. [Paper]
- Aligning Large Language Models through Synthetic Feedback. Sungdong Kim et al. arXiv 2023. [Paper]
- RLHF. Ansh Radhakrishnan. Lesswrong 2022. [Paper]
- Guiding Large Language Models via Directional Stimulus Prompting. Zekun Li et al. arXiv 2023. [Paper]
- Aligning Generative Language Models with Human Values. Ruibo Liu et al. NAACL 2022 Findings. [Paper]
- Second Thoughts Are Best: Learning to Re-Align with Human Values from Text Edits. Ruibo Liu et al. NeurIPS 2022. [Paper]
- Secrets of RLHF in Large Language Models Part I: PPO. Rui Zheng et al. arXiv 2023. [Paper]
- Principled Reinforcement Learning with Human Feedback from Pairwise or K-Wise Comparisons. Banghua Zhu et al. arXiv 2023. [Paper]
- Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. Stephen Casper et al. arXiv 2023. [Paper]
- Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. Timo Schick et al. TACL, 2021. [Paper]
- The Cringe Loss: Learning What Language Not to Model. Leonard Adolphs et al. arXiv 2022. [Paper]
- Leashing the Inner Demons: Self-detoxification for Language Models. Canwen Xu et al. AAAI 2022. [Paper]
- Calibrating Sequence Likelihood Improves Conditional Language Generation. Yao Zhao et al. arXiv 2022. [Paper]
- RAFT: Reward Ranked Finetuning for Generative Foundation Model Alignment. Hanze Dong et al. arXiv 2023. [Paper]
- Chain of Hindsight Aligns Language Models with Feedback. Hao Liu et al. arXiv 2023. [Paper]
- Training Socially Aligned Language Models in Simulated Human Society. Ruibo Liu et al. arXiv 2023. [Paper]
- Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. Rafael Rafailov et al. arXiv 2023. [Paper]
- Training Language Models with Language Feedback at Scale. Jérémy Scheurer et al. arXiv 2023. [Paper]
- Preference Ranking Optimization for Human Alignment. Feifan Song et al. arXiv 2023. [Paper]
- RRHF: Rank Responses to Align Language Models with Human Feedback without Tears. Zheng Yuan et al. arXiv 2023. [Paper]
- SLiC-HF: Sequence Likelihood Calibration with Human Feedback. Yao Zhao et al. arXiv 2023. [Paper]
- LIMA: Less Is More for Alignment. Chunting Zhou et al. arXiv 2023. [Paper]
- Supervising Strong Learners by Amplifying Weak Experts. Paul Christiano et al. arXiv 2018. [Paper]
- Scalable Agent Alignment via Reward Modeling: A Research Direction. Jan Leike et al. arXiv 2018. [Paper]
- AI Safety Needs Social Scientists. Geoffrey Irving, and Amanda Askell. Distill 2019. [Paper]
- Learning to Summarize with Human Feedback. Nisan Stiennon et al. NeurIPS 2020. [Paper]
- Task Decomposition for Scalable Oversight (AGISF Distillation). Charbel-Raphaël Segerie. 2023. [Blog]
- Measuring Progress on Scalable Oversight for Large Language Models. Samuel R Bowman et al. arXiv 2022. [Paper]
- Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai et al. CoRR 2022. [Paper]
- Improving Factuality and Reasoning in Language Models through Multiagent Debate. Yilun Du et al. arXiv 2023. [Paper]
- Evaluating Superhuman Models with Consistency Checks. Lukas Fluri et al. arXiv 2023. [Paper]
- AI Safety via Debate. Geoffrey Irving et al. arXiv 2018. [Paper]
- AI Safety via Market Making. Evan Hubinger. 2020. [Blog]
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. Tian Liang et al. arXiv 2023. [Paper]
- Let's Verify Step by Step. Hunter Lightman et al. arXiv 2023. [Paper]
- Introducing Superalignment. OpenAI. 2023. [Blog]
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. Zhiqing Sun et al. arXiv 2023. [Paper]
- Risks from Learned Optimization in Advanced Machine Learning Systems. Evan Hubinger et al. arXiv 2021. [Paper]
- Goal Misgeneralization in Deep Reinforcement Learning. Lauro Langosco et al. ICML 2022. [Paper]
- Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals. Rohin Shah et al. arXiv 2022. [Paper]
- Defining capability and alignment in gradient descent. Edouard Harris. Lesswrong 2020. [Blog]
- Categorizing failures as “outer” or “inner” misalignment is often confused. Rohin Shah. Lesswrong 2023. [Blog]
- Inner Alignment Failures" Which Are Actually Outer Alignment Failures. John Wentworth. Lesswrong 2020. [Blog]
- Relaxed adversarial training for inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
- The Inner Alignment Problem. Evan Hubinger et al. Lesswrong 2019. [Blog]
- Three scenarios of pseudo-alignment. Eleni Angelou. Lesswrong 2022. [Blog]
- Deceptive Alignment. Evan Hubinger et al. Lesswrong 2019. [Blog]
- What failure looks like. Paul Christiano. AI Alignment Forum 2019. [Blog]
- Concrete experiments in inner alignment. Evan Hubinger. Lesswrong 2019. [Blog]
- A central AI alignment problem: capabilities generalization, and the sharp left turn. Nate Soares. Lesswrong 2022. [Blog]
- Clarifying the confusion around inner alignment. Rauno Arike. AI Alignment Forum 2022. [Blog]
- 2-D Robustness. Vladimir Mikulik. AI Alignment Forum 2019. [Blog]
- Monitoring for deceptive alignment. Evan Hubinger. Lesswrong 2022. [Blog]
- Notions of explainability and evaluation approaches for explainable artificial intelligence. Giulia Vilone et al. arXiv 2020. [Paper]
- A Comprehensive Mechanistic Interpretability Explainer Glossary. Neel Nanda. 2022. [Paper]
- The Mythos of Model Interpretability. Zachary C. Lipton. arXiv 2017. [Paper]
- AI research considerations for human existential safety (ARCHES). Andrew Critch et al. arXiv 2020. [Paper]
- Concrete problems for autonomous vehicle safety: Advantages of Bayesian deep learning. RT McAllister et al. IJCAI 2017. [Paper]
- In-context Learning and Induction Heads. Catherine Olsson et al. Transformer Circuits Thread, 2022. [Paper]
- Transformer Feed-Forward Layers Are Key-Value Memories. Mor Geva et al. EMNLP 2021. [Paper]
- Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space. Mor Geva et al. EMNLP 2022. [Paper]
- Softmax Linear Units. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
- Toy Models of Superposition. Nelson Elhage et al. Transformer Circuits Thread 2022. [Paper]
- Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases. Chris Olah. 2022. [Paper]
- Knowledge Neurons in Pretrained Transformers. Dai Damai et al. ACL 2021. [Paper]
- Locating and editing factual associations in GPT. Kevin Meng et al. NeurIPS 2022. [Paper]
- Eliciting Truthful Answers from a Language Model. Kenneth Li et al. arXiv 2023. [Paper]
- LEACE: Perfect linear concept erasure in closed form. Nora Belrose et al. arXiv 2023. [Paper]
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. Gelei Deng et al. arXiv 2023. [Paper]
- Multi-step Jailbreaking Privacy Attacks on ChatGPT. Haoran Li et al. arXiv 2023. [Paper]
- Prompt Injection Attack Against LLM-integrated Applications. Yi Liu et al. arXiv 2023. [Paper]
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. Shuai Zhao et al. arXiv 2023. [Paper]
- More Than You've Asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models. Kai Greshake et al. arXiv 2023. [Paper]
- Backdoor Attacks for In-Context Learning with Language Models. Nikhil Kandpal et al. arXiv 2023. [Paper]
- BadGPT: Exploring Security Vulnerabilities of ChatGPT via Backdoor Attacks to InstructGPT. Jiawen Shi et al. arXiv 2023. [Paper]
- Universal and Transferable Adversarial Attacks on Aligned Language Models. Andy Zou et al. arXiv 2023. [Paper]
- Are Aligned Neural Networks Adversarially Aligned?. Nicholas Carlini et al. arXiv 2023. [Paper]
- Visual Adversarial Examples Jailbreak Large Language Models. Xiangyu Qi et al. arXiv 2023. [Paper]
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. Sewon Min et al. arXiv 2023. [Paper]
- Factuality Enhanced Language Models for Open-ended Text Generation. Nayeon Lee et al. NeurIPS 2022. [Paper]
- TruthfulQA: Measuring How Models Mimic Human Falsehoods. Stephanie Lin et al. arXiv 2021. [Paper]
- SummaC: Re-visiting NLI-based Models for Inconsistency Detection in Summarization. Philippe Laban et al. TACL 2022. [Paper]
- QAFactEval: Improved QA-based Factual Consistency Evaluation for Summarization. Alexander R. Fabbri et al. arXiv 2021. [Paper]
- TRUE: Re-evaluating Factual Consistency Evaluation. Or Honovich et al. arXiv 2022. [Paper]
- AlignScore: Evaluating Factual Consistency with a Unified Alignment Function. Yuheng Zha et al. arXiv 2023. [Paper]
- Social Chemistry 101: Learning to Reason about Social and Moral Norms. Maxwell Forbes et al. arXiv 2020. [Paper]
- Aligning AI with Shared Human Values. Dan Hendrycks et al. arXiv 2020. [Paper]
- Would You Rather? A New Benchmark for Learning Machine Alignment with Cultural Values and Social Preferences. Yi Tay et al. ACL 2020. [Paper]
- Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-life Anecdotes. Nicholas Lourie et al. AAAI 2021. [Paper]
- Detecting Offensive Language in Social Media to Protect Adolescent Online Safety. Ying Chen et al. PASSAT-SocialCom 2012. [Paper]
- Offensive Language Detection Using Multi-level Classification. Amir H. Razavi et al. Canadian AI 2010. [Paper]
- Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. Zeerak Waseem and Dirk Hovy. NAACL Student Research Workshop 2016. [Paper]
- Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis. Bjorn Ross et al. NLP4CMC 2016. [Paper]
- Ex Machina: Personal Attacks Seen at Scale. Ellery Wulczyn et al. WWW 2017. [Paper]
- Predicting the Type and Target of Offensive Posts in Social Media. Marcos Zampieri et al. NAACL-HLT 2019. [Paper]
- Recipes for Safety in Open-Domain Chatbots. Jing Xu et al. arXiv 2020. [Paper]
- RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Samuel Gehman et al. EMNLP 2020 Findings. [Paper]
- COLD: A Benchmark for Chinese Offensive Language Detection. Jiawen Deng et al. EMNLP 2022. [Paper]
- Gender Bias in Coreference Resolution. Rachel Rudinger et al. NAACL 2018. [Paper]
- Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. Jieyu Zhao et al. NAACL 2018. [Paper]
- The Winograd Schema Challenge. Hector Levesque et al. KR 2012. [Paper]
- Toward Gender-Inclusive Coreference Resolution: An Analysis of Gender and Bias Throughout the Machine Learning Lifecycle. Yang Trista Cao and Hal Daumé III. Computational Linguistics 2021. [Paper]
- Evaluating Gender Bias in Machine Translation. Gabriel Stanovsky et al. ACL 2019. [Paper]
- Investigating Failures of Automatic Translation in the Case of Unambiguous Gender. Adithya Renduchintala and Adina Williams. ACL 2022. [Paper]
- Towards Understanding Gender Bias in Relation Extraction. Andrew Gaut et al. ACL 2020. [Paper]
- Addressing Age-Related Bias in Sentiment Analysis. Mark Díaz et al. CHI 2018. [Paper]
- Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. Svetlana Kiritchenko and Saif M. Mohammad. NAACL-HLT 2018. [Paper]
- On Measuring and Mitigating Biased Inferences of Word Embeddings. Sunipa Dev et al. AAAI 2020. [Paper]
- Social Bias Frames: Reasoning About Social and Power Implications of Language. Maarten Sap et al. ACL 2020. [Paper]
- Towards Identifying Social Bias in Dialog Systems: Framework, Dataset, and Benchmark. Jingyan Zhou et al. EMNLP 2022 Findings. [Paper]
- CORGI-PM: A Chinese Corpus for Gender Bias Probing and Mitigation. Ge Zhang et al. arXiv 2023. [Paper]
- StereoSet: Measuring Stereotypical Bias in Pretrained Language Models. Moin Nadeem et al. ACL 2021. [Paper]
- CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Nikita Nangia et al. EMNLP 2020. [Paper]
- BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. Jwala Dhamala et al. FAccT 2021. [Paper]
- “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset. Eric Michael Smith et al. EMNLP 2022. [Paper]
- Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale. Marta R. Costa-jussà et al. arXiv 2023. [Paper]
- UNQOVERing Stereotyping Biases via Underspecified Questions. Tao Li et al. EMNLP 2020 Findings. [Paper]
- BBQ: A Hand-Built Bias Benchmark for Question Answering. Alicia Parrish et al. ACL 2022 Findings. [Paper]
- CBBQ: A Chinese Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models. Yufei Huang and Deyi Xiong. arXiv 2023. [Paper]
- Automated Hate Speech Detection and the Problem of Offensive Language. Thomas Davidson et al. AAAI 2017. [Paper]
- Deep Learning for Hate Speech Detection in Tweets. Pinkesh Badjatiya et al. WWW 2017. [Paper]
- Detecting Hate Speech on the World Wide Web. William Warner and Julia Hirschberg. NAACL-HLT 2012. [Paper]
- A Survey on Hate Speech Detection using Natural Language Processing. Anna Schmidt and Michael Wiegand. SocialNLP 2017. [Paper]
- Hate Speech Detection with Comment Embeddings. Nemanja Djuric et al. WWW 2015. [Paper]
- Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter. Zeerak Waseem. NLP+CSS@EMNLP 2016. [Paper]
- TweetBLM: A Hate Speech Dataset and Analysis of Black Lives Matter-related Microblogs on Twitter. Sumit Kumar and Raj Ratn Pranesh. arXiv 2021. [Paper]
- Hate Speech Dataset from a White Supremacy Forum. Ona de Gibert et al. ALW2 2018. [Paper]
- The Gab Hate Corpus: A Collection of 27k Posts Annotated for Hate Speech. Brendan Kennedy et al. LRE 2022 [Paper]
- Finding Microaggressions in the Wild: A Case for Locating Elusive Phenomena in Social Media Posts. Luke Breitfeller et al. EMNLP 2019. [Paper]
- Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection. Bertie Vidgen et al. ACL 2021. [Paper]
- Hate speech detection: Challenges and solutions. Sean MacAvaney et al. PloS One 2019. [Paper]
- Racial Microaggressions in Everyday Life: Implications for Clinical Practice. Derald Wing Sue et al. American Psychologist 2007. [Paper]
- The Impact of Racial Microaggressions on Mental Health: Counseling Implications for Clients of Color. Kevin L. Nadal et al. Journal of Counseling & Development 2014. [Paper]
- A Preliminary Report on the Relationship Between Microaggressions Against Black People and Racism Among White College Students. Jonathan W. Kanter et al. Race and Social Problems 2017. [Paper]
- Microaggressions and Traumatic Stress: Theory, Research, and Clinical Treatment. Kevin L. Nadal. American Psychological Association 2018. [Paper]
- Arabs as Terrorists: Effects of Stereotypes Within Violent Contexts on Attitudes, Perceptions, and Affect. Muniba Saleem and Craig A. Anderson. Psychology of Violence 2013. [Paper]
- Mean Girls? The Influence of Gender Portrayals in Teen Movies on Emerging Adults' Gender-Based Attitudes and Beliefs. Elizabeth Behm-Morawitz and Dana E. Mastro. Journalism and Mass Communication Quarterly 2008. [Paper]
- Exposure to Hate Speech Increases Prejudice Through Desensitization. Wiktor Soral, Michał Bilewicz, and Mikołaj Winiewski. Aggressive behavior 2018. [Paper]
- Latent Hatred: A Benchmark for Understanding Implicit Hate Speech. Mai ElSherief et al. EMNLP 2021. [Paper]
- ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection. Thomas Hartvigsen et al. ACL 2022. [Paper]
- An Empirical Study of Metrics to Measure Representational Harms in Pre-Trained Language Models. Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. arXiv 2023. [Paper]
- TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. Yue Huang et al. arXiv 2023. [Paper]
- Safety Assessment of Chinese Large Language Models. Hao Sun et al. arXiv 2023. [Paper]
- FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets. Seonghyeon Ye et al. arXiv 2023. [Paper]
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Lianmin Zheng et al. arXiv 2023. [Paper]
- Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. Aarohi Srivastava et al. arXiv 2023. [Paper]
- A Critical Evaluation of Evaluations for Long-form Question Answering. Fangyuan Xu et al. arXiv 2023. [Paper]
- AlpacaEval: An Automatic Evaluator of Instruction-following Models. Xuechen Li et al. Github 2023. [Github]
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. Yann Dubois et al. Github 2023. [Paper]
- PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. Yidong Wang et al. arXiv 2023. [Paper]
- Large Language Models are not Fair Evaluators. Peiyi Wang et al. arXiv 2023. [Paper]
- G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. Yang Liu et al. arXiv 2023. [Paper]
- Benchmarking Foundation Models with Language-Model-as-an-Examiner. Yushi Bai et al. arXiv 2023. [Paper]
- PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. Ruosen Li et al. arXiv 2023. [Paper]
- SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions. Yizhong Wang et al. arXiv 2023. [Paper]