Skip to content

Latest commit

 

History

History
649 lines (458 loc) · 67.9 KB

awesome_llm_alignment.md

File metadata and controls

649 lines (458 loc) · 67.9 KB

LLM Alignment

Survey

  • A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More, arXiv, 2407.16216, arxiv, pdf, cication: -1

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur

  • Towards Scalable Automated Alignment of LLMs: A Survey, arXiv, 2406.01252, arxiv, pdf, cication: -1

    Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han

  • Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, arXiv, 2404.10719, arxiv, pdf, cication: -1

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu

  • On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models, arXiv, 2403.04204, arxiv, pdf, cication: -1

    Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao, Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan Xu, Maosong Sun, Xing Xie

  • AI Alignment: A Comprehensive Survey, arXiv, 2310.19852, arxiv, pdf, cication: 1

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang

  • Instruction Tuning for Large Language Models: A Survey, arXiv, 2308.10792, arxiv, pdf, cication: 19

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu · (mp.weixin.qq)

  • Large Language Model Alignment: A Survey, arXiv, 2309.15025, arxiv, pdf, cication: 3

    Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong · (jiqizhixin) · (llm-alignment-survey - Magnetic2014) Star

  • Aligning Large Language Models with Human: A Survey, arXiv, 2307.12966, arxiv, pdf, cication: 29

    Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu · (AlignLLMHumanSurvey - GaryYufei) Star

Paper & Projects

  • Course-Correction: Safety Alignment Using Synthetic Preferences, arXiv, 2407.16637, arxiv, pdf, cication: -1

    Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu

  • Better Alignment with Instruction Back-and-Forth Translation, arXiv, 2408.04614, arxiv, pdf, cication: -1

    Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li

  • Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle, arXiv, 2407.13833, arxiv, pdf, cication: -1

    Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang

  • Direct Preference Knowledge Distillation for Large Language Models, arXiv, 2406.19774, arxiv, pdf, cication: -1

    Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei

  • On scalable oversight with weak LLMs judging strong LLMs, arXiv, 2407.04622, arxiv, pdf, cication: -1

    Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman

  • Creativity Has Left the Chat: The Price of Debiasing Language Models, arXiv, 2406.05587, arxiv, pdf, cication: -1

    Behnam Mohammadi

  • Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms, arXiv, 2406.02900, arxiv, pdf, cication: -1

    Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum

  • Self-Improving Robust Preference Optimization, arXiv, 2406.01660, arxiv, pdf, cication: -1

    Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar

  • Show, Don't Tell: Aligning Language Models with Demonstrated Feedback, arXiv, 2406.00888, arxiv, pdf, cication: -1

    Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang

    · (demonstrated-feedback - SALT-NLP) Star

  • Xwin-LM: Strong and Scalable Alignment Practice for LLMs, arXiv, 2405.20335, arxiv, pdf, cication: -1

    Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng Meng, Han Hu · (Xwin-LM - Xwin-LM) Star

  • Offline Regularised Reinforcement Learning for Large Language Models Alignment, arXiv, 2405.19107, arxiv, pdf, cication: -1

    Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth

  • FLAME: Factuality-Aware Alignment for Large Language Models, arXiv, 2405.01525, arxiv, pdf, cication: -1

    Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen

  • NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment, arXiv, 2405.01481, arxiv, pdf, cication: -1

    Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi · (NeMo-Aligner - NVIDIA) Star

  • Simple probes can catch sleeper agents \ Anthropic

  • The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, arXiv, 2404.13208, arxiv, pdf, cication: -1

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel

  • OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data, arXiv, 2404.12195, arxiv, pdf, cication: -1

    Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake · (huggingface)

  • Learn Your Reference Model for Real Good Alignment, arXiv, 2404.09656, arxiv, pdf, cication: -1

    Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov

  • Foundational Challenges in Assuring Alignment and Safety of Large Language Models, arXiv, 2404.09932, arxiv, pdf, cication: -1

    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut

  • Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data, arXiv, 2404.03862, arxiv, pdf, cication: -1

    Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, Daniel Khashabi

  • CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues, arXiv, 2404.03820, arxiv, pdf, cication: -1

    Makesh Narsimhan Sreedhar, Traian Rebedea, Shaona Ghosh, Christopher Parisien

  • Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences, arXiv, 2404.03715, arxiv, pdf, cication: -1

    Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, Tengyang Xie

  • Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations, arXiv, 2403.09704, arxiv, pdf, cication: -1

    Swapnaja Achintalwar, Ioana Baldini, Djallel Bouneffouf, Joan Byamugisha, Maria Chang, Pierre Dognin, Eitan Farchi, Ndivhuwo Makondo, Aleksandra Mojsilovic, Manish Nagireddy

  • Instruction-tuned Language Models are Better Knowledge Learners, arXiv, 2402.12847, arxiv, pdf, cication: -1

    Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer

  • Reformatted Alignment, arXiv, 2402.12219, arxiv, pdf, cication: -1

    Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, Pengfei Liu · (ReAlign - GAIR-NLP) Star · (gair-nlp.github)

    · (qbitai)

  • Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction, arXiv, 2402.02416, arxiv, pdf, cication: -1

    Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang · (jiqizhixin) · (aligner2024.github)

  • LESS: Selecting Influential Data for Targeted Instruction Tuning, arXiv, 2402.04333, arxiv, pdf, cication: -1

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen · (less - princeton-nlp) Star

    · (qbitai)

  • Generative Representational Instruction Tuning, arXiv, 2402.09906, arxiv, pdf, cication: -1

    Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela

  • DeAL: Decoding-time Alignment for Large Language Models, arXiv, 2402.06147, arxiv, pdf, cication: -1

    James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchoff, Dan Roth

  • Direct Language Model Alignment from Online AI Feedback, arXiv, 2402.04792, arxiv, pdf, cication: -1

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot

  • Specialized Language Models with Cheap Inference from Limited Domain Data, arXiv, 2402.01093, arxiv, pdf, cication: -1

    David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun

  • Human-Instruction-Free LLM Self-Alignment with Limited Samples, arXiv, 2401.06785, arxiv, pdf, cication: -1

    Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, Yang Liu

  • WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation, arXiv, 2312.14187, arxiv, pdf, cication: -1

    Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin

  • Teach Llamas to Talk: Recent Progress in Instruction Tuning

    · (mp.weixin.qq)

  • weak-to-strong - openai Star

    · (openai) · (cdn.openai) · (jiqizhixin) · (mp.weixin.qq)

  • Alignment for Honesty, arXiv, 2312.07000, arxiv, pdf, cication: -1

    Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu · (alignment-for-honesty - GAIR-NLP) Star

  • The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning, arXiv, 2312.01552, arxiv, pdf, cication: -1

    Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi · (allenai.github)

    · (jiqizhixin)

    · (URIAL - Re-Align) Star

  • Instruction-tuning Aligns LLMs to the Human Brain, arXiv, 2312.00575, arxiv, pdf, cication: -1

    Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, Antoine Bosselut

  • wizardlm - nlpxucan Star

    Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder

  • Trusted Source Alignment in Large Language Models, arXiv, 2311.06697, arxiv, pdf, cication: -1

    Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, Edward Clifford, Yennie Jun, William W. Cohen, Simon Baumgartner

  • AlignBench: Benchmarking Chinese Alignment of Large Language Models, arXiv, 2311.18743, arxiv, pdf, cication: 8

    Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam · (AlignBench - THUDM) Star

  • Zephyr: Direct Distillation of LM Alignment, arXiv, 2310.16944, arxiv, pdf, cication: 1

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib · (alignment-handbook - huggingface) Star

  • Controlled Decoding from Language Models, arXiv, 2310.17022, arxiv, pdf, cication: -1

    Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman

  • Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models, arXiv, 2310.13127, arxiv, pdf, cication: -1

    Zhihan Zhang, Shuohang Wang, Wenhao Yu, Yichong Xu, Dan Iter, Qingkai Zeng, Yang Liu, Chenguang Zhu, Meng Jiang

  • An Emulator for Fine-Tuning Large Language Models using Small Language Models, arXiv, 2310.12962, arxiv, pdf, cication: -1

    Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn, Christopher D. Manning

  • NEFTune: Noisy Embeddings Improve Instruction Finetuning, arXiv, 2310.05914, arxiv, pdf, cication: -1

    Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha · (qbitai)

  • alignment-handbook - huggingface Star

    Robust recipes for to align language models with human and AI preferences

  • Xwin-LM - Xwin-LM Star

    Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment · (mp.weixin.qq)

  • Self-Alignment with Instruction Backtranslation, arXiv, 2308.06259, arxiv, pdf, cication: 13

    Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis · (jiqizhixin)

  • Simple synthetic data reduces sycophancy in large language models, arXiv, 2308.03958, arxiv, pdf, cication: 7

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le

  • alignllmhumansurvey - garyyufei Star

    Aligning Large Language Models with Human: A Survey

  • RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment, arXiv, 2307.12950, arxiv, pdf, cication: 5

    Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian

  • AlpaGasus: Training A Better Alpaca with Fewer Data, arXiv, 2307.08701, arxiv, pdf, cication: 11

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang · (lichang-chen.github)

  • Instruction Mining: When Data Mining Meets Large Language Model Finetuning, arXiv, 2307.06290, arxiv, pdf, cication: 3

    Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun

  • Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, arXiv, 2307.03692, arxiv, pdf, cication: 2

    Waseem AlShikh, Manhal Daaboul, Kirk Goddard, Brock Imel, Kiran Kamble, Parikshith Kulkarni, Melisa Russak

  • Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts, arXiv, 2307.02768, arxiv, pdf, cication: 2

    Mounica Maddela, Megan Ung, Jing Xu, Andrea Madotto, Heather Foran, Y-Lan Boureau

  • Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control, arXiv, 2307.00117, arxiv, pdf, cication: 3

    Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine

  • On the Exploitability of Instruction Tuning, arXiv, 2306.17194, arxiv, pdf, cication: 4

    Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein

  • Are aligned neural networks adversarially aligned?, arXiv, 2306.15447, arxiv, pdf, cication: 30

    Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer

  • Constitutional AI: Harmlessness from AI Feedback, arXiv, 2212.08073, arxiv, pdf, cication: 249

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon

  • A General Language Assistant as a Laboratory for Alignment, arXiv, 2112.00861, arxiv, pdf, cication: 61

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma

Other

Awesome RLHF

Survey

  • A Survey of Reinforcement Learning from Human Feedback, arXiv, 2312.14925, arxiv, pdf, cication: 5

    Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier

Papers

  • Understanding Reference Policies in Direct Preference Optimization, arXiv, 2407.13709, arxiv, pdf, cication: -1

    Yixin Liu, Pengfei Liu, Arman Cohan · (refdpo - yale-nlp) Star

  • Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning, arXiv, 2407.15762, arxiv, pdf, cication: -1

    Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey

  • Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning, arXiv, 2407.00782, arxiv, pdf, cication: -1

    Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li

    · (Step-Controlled_DPO - mathllm) Star

  • Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs, arXiv, 2406.18629, arxiv, pdf, cication: -1

    Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya Jia

    · (Step-DPO - dvlab-research) Star

  • WARP: On the Benefits of Weight Averaged Rewarded Policies, arXiv, 2406.16768, arxiv, pdf, cication: -1

    Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem

  • Bootstrapping Language Models with DPO Implicit Rewards, arXiv, 2406.09760, arxiv, pdf, cication: -1

    Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin · (dice - sail-sg) Star

  • WPO: Enhancing RLHF with Weighted Preference Optimization, arXiv, 2406.11827, arxiv, pdf, cication: -1

    Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu · (WPO - wzhouad) Star

  • mDPO: Conditional Preference Optimization for Multimodal Large Language Models, arXiv, 2406.11839, arxiv, pdf, cication: -1

    Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen

  • Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models, arXiv, 2406.10162, arxiv, pdf, cication: -1

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan

  • Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs, arXiv, 2406.08657, arxiv, pdf, cication: -1

    Chen Zheng, Ke Sun, Xun Zhou

  • HelpSteer2: Open-source dataset for training top-performing reward models, arXiv, 2406.08673, arxiv, pdf, cication: -1

    Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev · (NeMo-Aligner - NVIDIA) Star · (huggingface)

  • Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, arXiv, 2406.09279, arxiv, pdf, cication: -1

    Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi · (EasyLM - hamishivi) Star

  • Discovering Preference Optimization Algorithms with and for Large Language Models, arXiv, 2406.08414, arxiv, pdf, cication: -1

    Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, Robert Tjarko Lange · (DiscoPOP - SakanaAI) Star

  • Self-Exploring Language Models: Active Preference Elicitation for Online Alignment, arXiv, 2405.19332, arxiv, pdf, cication: -1

    Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang · (SELM - shenao-zhang) Star

  • Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF, arXiv, 2405.19320, arxiv, pdf, cication: -1

    Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai

  • SimPO: Simple Preference Optimization with a Reference-Free Reward, arXiv, 2405.14734, arxiv, pdf, cication: -1

    Yu Meng, Mengzhou Xia, Danqi Chen

  • OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework, arXiv, 2405.11143, arxiv, pdf, cication: -1

    Jian Hu, Xibin Wu, Weixun Wang, Xianyu, Dehao Zhang, Yu Cao · (OpenRLHF - OpenLLMAI) Star

  • RLHF Workflow: From Reward Modeling to Online RLHF, arXiv, 2405.07863, arxiv, pdf, cication: -1

    Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang · (Online-RLHF - RLHFlow) Star · (RLHF-Reward-Modeling - RLHFlow) Star

    · (huggingface)

  • Self-Play Preference Optimization for Language Model Alignment, arXiv, 2405.00675, arxiv, pdf, cication: -1

    Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu

  • Iterative Reasoning Preference Optimization, arXiv, 2404.19733, arxiv, pdf, cication: -1

    Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston

  • Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks, arXiv, 2404.14723, arxiv, pdf, cication: -1

    Amir Saeidi, Shivanshu Verma, Chitta Baral

  • From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function, arXiv, 2404.12358, arxiv, pdf, cication: -1

    Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn

  • Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, arXiv, 2404.10719, arxiv, pdf, cication: -1

    Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu

  • Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment, arXiv, 2404.12318, arxiv, pdf, cication: -1

    Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami

  • Dataset Reset Policy Optimization for RLHF, arXiv, 2404.08495, arxiv, pdf, cication: -1

    Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun · (drpo - Cornell-RL) Star

  • RewardBench: Evaluating Reward Models for Language Modeling, arXiv, 2403.13787, arxiv, pdf, cication: -1

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi

    • a benchmark dataset and toolkit designed for the comprehensive evaluation of reward models used in RLHF
  • reward-bench - allenai Star

    RewardBench: the first evaluation tool for reward models. · (huggingface) · (twitter)

  • ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback, arXiv, 2404.00934, arxiv, pdf, cication: -1

    Zhenyu Hou, Yilin Niu, Zhengxiao Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang

  • sDPO: Don't Use Your Data All at Once, arXiv, 2403.19270, arxiv, pdf, cication: -1

    Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park

  • The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization, arXiv, 2403.17031, arxiv, pdf, cication: -1

    Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, Lewis Tunstall

    · (summarize_from_feedback_details - vwxyzjn) Star · (huggingface) · (twitter)

  • PERL: Parameter Efficient Reinforcement Learning from Human Feedback, arXiv, 2403.10704, arxiv, pdf, cication: -1

    Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu

    • (PERL) using Low-Rank Adaptation (LoRA) for training models with Reinforcement Learning from Human Feedback (RLHF), a method that aligns pretrained base LLMs with human preferences efficiently.
  • ORPO: Monolithic Preference Optimization without Reference Model, arXiv, 2403.07691, arxiv, pdf, cication: -1

    Jiwoo Hong, Noah Lee, James Thorne · (orpo - xfactlab) Star

  • Teaching Large Language Models to Reason with Reinforcement Learning, arXiv, 2403.04642, arxiv, pdf, cication: -1

    Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu

  • Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, arXiv, 2402.14740, arxiv, pdf, cication: -1

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, Sara Hooker

  • Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive, arXiv, 2402.13228, arxiv, pdf, cication: -1

    Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

  • A Critical Evaluation of AI Feedback for Aligning Large Language Models, arXiv, 2402.12366, arxiv, pdf, cication: -1

    Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar

  • RLVF: Learning from Verbal Feedback without Overgeneralization, arXiv, 2402.10893, arxiv, pdf, cication: -1

    Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn

  • A Minimaximalist Approach to Reinforcement Learning from Human Feedback, arXiv, 2401.04056, arxiv, pdf, cication: 4

    Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal · (jiqizhixin)

  • Suppressing Pink Elephants with Direct Principle Feedback, arXiv, 2402.07896, arxiv, pdf, cication: -1

    Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman

  • ODIN: Disentangled Reward Mitigates Hacking in RLHF, arXiv, 2402.07319, arxiv, pdf, cication: -1

    Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro

  • LiPO: Listwise Preference Optimization through Learning-to-Rank, arXiv, 2402.01878, arxiv, pdf, cication: -1

    Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu

  • StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback, arXiv, 2402.01391, arxiv, pdf, cication: -1

    Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan

  • Transforming and Combining Rewards for Aligning Large Language Models, arXiv, 2402.00742, arxiv, pdf, cication: -1

    Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch

  • Aligning Large Language Models with Counterfactual DPO, arXiv, 2401.09566, arxiv, pdf, cication: -1

    Bradley Butcher

  • WARM: On the Benefits of Weight Averaged Reward Models, arXiv, 2401.12187, arxiv, pdf, cication: -1

    Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret

  • A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity, arXiv, 2401.01967, arxiv, pdf, cication: 11

    Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea

  • ReFT: Reasoning with Reinforced Fine-Tuning, arXiv, 2401.08967, arxiv, pdf, cication: -1

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li

  • Self-Rewarding Language Models, arXiv, 2401.10020, arxiv, pdf, cication: -1

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

  • Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, arXiv, 2401.08417, arxiv, pdf, cication: -1

    Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim

  • Secrets of RLHF in Large Language Models Part II: Reward Modeling, arXiv, 2401.06080, arxiv, pdf, cication: -1

    Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi

    · (jiqizhixin)

  • ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers, arXiv, 2401.02072, arxiv, pdf, cication: -1

    Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou

  • InstructVideo: Instructing Video Diffusion Models with Human Feedback, arXiv, 2312.12490, arxiv, pdf, cication: -1

    Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

  • Silkie: Preference Distillation for Large Visual Language Models, arXiv, 2312.10665, arxiv, pdf, cication: -1

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong

  • Align on the Fly: Adapting Chatbot Behavior to Established Norms, arXiv, 2312.15907, arxiv, pdf, cication: -1

    Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu · (jiqizhixin) · (OPO - GAIR-NLP) Star · (gair-nlp.github)

  • Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, arXiv, 2312.09244, arxiv, pdf, cication: -1

    Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran

  • Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, arXiv, 2312.06585, arxiv, pdf, cication: -1

    Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi

  • HALOs - ContextualAI Star

    Human-Centered Loss Functions (HALOs) · (HALOs - ContextualAI) Star

  • Axiomatic Preference Modeling for Longform Question Answering, arXiv, 2312.02206, arxiv, pdf, cication: -1

    Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett · (huggingface)

  • Nash Learning from Human Feedback, arXiv, 2312.00886, arxiv, pdf, cication: -1

    Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi

  • RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback, arXiv, 2312.00849, arxiv, pdf, cication: -1

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun · (RLHF-V - RLHF-V) Star

  • Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF

  • Adversarial Preference Optimization, arXiv, 2311.08045, arxiv, pdf, cication: -1

    Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du

    · (mp.weixin.qq)

  • Diffusion Model Alignment Using Direct Preference Optimization, arXiv, 2311.12908, arxiv, pdf, cication: -1

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik

  • Black-Box Prompt Optimization: Aligning Large Language Models without Model Training, arXiv, 2311.04155, arxiv, pdf, cication: -1

    Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang · (bpo - thu-coai) Star

  • Towards Understanding Sycophancy in Language Models, arXiv, 2310.13548, arxiv, pdf, cication: -1

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston · (jiqizhixin)

  • Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

    Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh · (jiqizhixin)

  • Don't throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding, arXiv, 2309.15028, arxiv, pdf, cication: 1

    Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz · (jiqizhixin)

  • The N Implementation Details of RLHF with PPO

  • Specific versus General Principles for Constitutional AI, arXiv, 2310.13798, arxiv, pdf, cication: 1

    Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean

  • Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

    Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh

  • A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv, 2310.12036, arxiv, pdf, cication: 1

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos

  • Tuna: Instruction Tuning using Feedback from Large Language Models, arXiv, 2310.13385, arxiv, pdf, cication: -1

    Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei

  • Safe RLHF: Safe Reinforcement Learning from Human Feedback, arXiv, 2310.12773, arxiv, pdf, cication: 1

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang

  • ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, arXiv, 2310.10505, arxiv, pdf, cication: -1

    Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo · (jiqizhixin)

  • Rethinking the Role of PPO in RLHF – The Berkeley Artificial Intelligence Research Blog

  • Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond, arXiv, 2310.06147, arxiv, pdf, cication: -1

    Hao Sun

  • A Long Way to Go: Investigating Length Correlations in RLHF, arXiv, 2310.03716, arxiv, pdf, cication: 3

    Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett

  • Aligning Large Multimodal Models with Factually Augmented RLHF, arXiv, 2309.14525, arxiv, pdf, cication: 4

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang

  • Stabilizing RLHF through Advantage Model and Selective Rehearsal, arXiv, 2309.10202, arxiv, pdf, cication: 1

    Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu

  • Statistical Rejection Sampling Improves Preference Optimization, arXiv, 2309.06657, arxiv, pdf, cication: -1

    Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu

  • Efficient RLHF: Reducing the Memory Usage of PPO, arXiv, 2309.00754, arxiv, pdf, cication: 1

    Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen

  • RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, arXiv, 2309.00267, arxiv, pdf, cication: 24

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · (mp.weixin.qq)

  • Reinforced Self-Training (ReST) for Language Modeling, arXiv, 2308.08998, arxiv, pdf, cication: 12

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu · (jiqizhixin)

  • DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales, arXiv, 2308.01320, arxiv, pdf, cication: 4

    Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes

  • Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, arXiv, 2307.15217, arxiv, pdf, cication: 36

    Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire · (jiqizhixin)

  • ICML '23 Tutorial on Reinforcement Learning from Human Feedback

    · (openlmlab.github) · (mp.weixin.qq)

  • Fine-Tuning Language Models with Advantage-Induced Policy Alignment, arXiv, 2306.02231, arxiv, pdf, cication: 5

    Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao

  • System-Level Natural Language Feedback, arXiv, 2306.13588, arxiv, pdf, cication: 1

    Weizhe Yuan, Kyunghyun Cho, Jason Weston

  • Fine-Grained Human Feedback Gives Better Rewards for Language Model Training, arXiv, 2306.01693, arxiv, pdf, cication: 7

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi · (finegrainedrlhf.github) · (qbitai)

  • Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv, 2305.18290, arxiv, pdf, cication: -1

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

  • Let's Verify Step by Step, arXiv, 2305.20050, arxiv, pdf, cication: 76

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv, 2204.05862, arxiv, pdf, cication: 109

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan · (hh-rlhf - anthropics) Star

  • Training language models to follow instructions with human feedback, NeurIPS, 2022, arxiv, pdf, cication: 6793

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray

  • Learning to summarize from human feedback, NeurIPS, 2020, arxiv, pdf, cication: 1122

    Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano

Projects

Other

Extra reference

  • awesome-RLHF - opendilab Star

    A curated list of reinforcement learning with human feedback resources (continually updated)