LLM Alignment

LLM Alignment
Awesome RLHF
- Survey
- Papers
- Projects
- Other
- Extra reference

Survey

A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More, arXiv, 2407.16216, arxiv, pdf, cication: -1

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur
Towards Scalable Automated Alignment of LLMs: A Survey, arXiv, 2406.01252, arxiv, pdf, cication: -1

Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, arXiv, 2404.10719, arxiv, pdf, cication: -1

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models, arXiv, 2403.04204, arxiv, pdf, cication: -1

Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao, Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan Xu, Maosong Sun, Xing Xie
AI Alignment: A Comprehensive Survey, arXiv, 2310.19852, arxiv, pdf, cication: 1

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang
Instruction Tuning for Large Language Models: A Survey, arXiv, 2308.10792, arxiv, pdf, cication: 19

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu · (mp.weixin.qq)
Large Language Model Alignment: A Survey, arXiv, 2309.15025, arxiv, pdf, cication: 3

Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong · (jiqizhixin) · (llm-alignment-survey - Magnetic2014)
Aligning Large Language Models with Human: A Survey, arXiv, 2307.12966, arxiv, pdf, cication: 29

Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu · (AlignLLMHumanSurvey - GaryYufei)

Paper & Projects

Course-Correction: Safety Alignment Using Synthetic Preferences, arXiv, 2407.16637, arxiv, pdf, cication: -1

Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu
Better Alignment with Instruction Back-and-Forth Translation, arXiv, 2408.04614, arxiv, pdf, cication: -1

Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle, arXiv, 2407.13833, arxiv, pdf, cication: -1

Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang
Direct Preference Knowledge Distillation for Large Language Models, arXiv, 2406.19774, arxiv, pdf, cication: -1

Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei
On scalable oversight with weak LLMs judging strong LLMs, arXiv, 2407.04622, arxiv, pdf, cication: -1

Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman
Creativity Has Left the Chat: The Price of Debiasing Language Models, arXiv, 2406.05587, arxiv, pdf, cication: -1

Behnam Mohammadi
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms, arXiv, 2406.02900, arxiv, pdf, cication: -1

Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum
Self-Improving Robust Preference Optimization, arXiv, 2406.01660, arxiv, pdf, cication: -1

Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback, arXiv, 2406.00888, arxiv, pdf, cication: -1

Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang

· (demonstrated-feedback - SALT-NLP)
Xwin-LM: Strong and Scalable Alignment Practice for LLMs, arXiv, 2405.20335, arxiv, pdf, cication: -1

Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng Meng, Han Hu · (Xwin-LM - Xwin-LM)
Offline Regularised Reinforcement Learning for Large Language Models Alignment, arXiv, 2405.19107, arxiv, pdf, cication: -1

Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth
FLAME: Factuality-Aware Alignment for Large Language Models, arXiv, 2405.01525, arxiv, pdf, cication: -1

Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment, arXiv, 2405.01481, arxiv, pdf, cication: -1

Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi · (NeMo-Aligner - NVIDIA)
Simple probes can catch sleeper agents \ Anthropic
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions, arXiv, 2404.13208, arxiv, pdf, cication: -1

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data, arXiv, 2404.12195, arxiv, pdf, cication: -1

Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake · (huggingface)
Learn Your Reference Model for Real Good Alignment, arXiv, 2404.09656, arxiv, pdf, cication: -1

Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
Foundational Challenges in Assuring Alignment and Safety of Large Language Models, arXiv, 2404.09932, arxiv, pdf, cication: -1

Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data, arXiv, 2404.03862, arxiv, pdf, cication: -1

Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, Daniel Khashabi
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues, arXiv, 2404.03820, arxiv, pdf, cication: -1

Makesh Narsimhan Sreedhar, Traian Rebedea, Shaona Ghosh, Christopher Parisien
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences, arXiv, 2404.03715, arxiv, pdf, cication: -1

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, Tengyang Xie
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations, arXiv, 2403.09704, arxiv, pdf, cication: -1

Swapnaja Achintalwar, Ioana Baldini, Djallel Bouneffouf, Joan Byamugisha, Maria Chang, Pierre Dognin, Eitan Farchi, Ndivhuwo Makondo, Aleksandra Mojsilovic, Manish Nagireddy
Instruction-tuned Language Models are Better Knowledge Learners, arXiv, 2402.12847, arxiv, pdf, cication: -1

Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer
Reformatted Alignment, arXiv, 2402.12219, arxiv, pdf, cication: -1

Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, Pengfei Liu · (ReAlign - GAIR-NLP) · (gair-nlp.github)

· (qbitai)
Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction, arXiv, 2402.02416, arxiv, pdf, cication: -1

Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang · (jiqizhixin) · (aligner2024.github)
LESS: Selecting Influential Data for Targeted Instruction Tuning, arXiv, 2402.04333, arxiv, pdf, cication: -1

Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen · (less - princeton-nlp)

· (qbitai)
Generative Representational Instruction Tuning, arXiv, 2402.09906, arxiv, pdf, cication: -1

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela
DeAL: Decoding-time Alignment for Large Language Models, arXiv, 2402.06147, arxiv, pdf, cication: -1

James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchoff, Dan Roth
Direct Language Model Alignment from Online AI Feedback, arXiv, 2402.04792, arxiv, pdf, cication: -1

Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot
Specialized Language Models with Cheap Inference from Limited Domain Data, arXiv, 2402.01093, arxiv, pdf, cication: -1

David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun
Human-Instruction-Free LLM Self-Alignment with Limited Samples, arXiv, 2401.06785, arxiv, pdf, cication: -1

Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, Yang Liu
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation, arXiv, 2312.14187, arxiv, pdf, cication: -1

Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin
Teach Llamas to Talk: Recent Progress in Instruction Tuning

· (mp.weixin.qq)
weak-to-strong - openai

· (openai) · (cdn.openai) · (jiqizhixin) · (mp.weixin.qq)
Alignment for Honesty, arXiv, 2312.07000, arxiv, pdf, cication: -1

Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu · (alignment-for-honesty - GAIR-NLP)
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning, arXiv, 2312.01552, arxiv, pdf, cication: -1

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi · (allenai.github)

· (jiqizhixin)

· (URIAL - Re-Align)
Instruction-tuning Aligns LLMs to the Human Brain, arXiv, 2312.00575, arxiv, pdf, cication: -1

Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, Antoine Bosselut
wizardlm - nlpxucan

Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder
Trusted Source Alignment in Large Language Models, arXiv, 2311.06697, arxiv, pdf, cication: -1

Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, Edward Clifford, Yennie Jun, William W. Cohen, Simon Baumgartner
AlignBench: Benchmarking Chinese Alignment of Large Language Models, arXiv, 2311.18743, arxiv, pdf, cication: 8

Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam · (AlignBench - THUDM)
Zephyr: Direct Distillation of LM Alignment, arXiv, 2310.16944, arxiv, pdf, cication: 1

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib · (alignment-handbook - huggingface)
Controlled Decoding from Language Models, arXiv, 2310.17022, arxiv, pdf, cication: -1

Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman
Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models, arXiv, 2310.13127, arxiv, pdf, cication: -1

Zhihan Zhang, Shuohang Wang, Wenhao Yu, Yichong Xu, Dan Iter, Qingkai Zeng, Yang Liu, Chenguang Zhu, Meng Jiang
An Emulator for Fine-Tuning Large Language Models using Small Language Models, arXiv, 2310.12962, arxiv, pdf, cication: -1

Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn, Christopher D. Manning
NEFTune: Noisy Embeddings Improve Instruction Finetuning, arXiv, 2310.05914, arxiv, pdf, cication: -1

Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha · (qbitai)
alignment-handbook - huggingface

Robust recipes for to align language models with human and AI preferences
Xwin-LM - Xwin-LM

Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment · (mp.weixin.qq)
Self-Alignment with Instruction Backtranslation, arXiv, 2308.06259, arxiv, pdf, cication: 13

Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis · (jiqizhixin)
Simple synthetic data reduces sycophancy in large language models, arXiv, 2308.03958, arxiv, pdf, cication: 7

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le
alignllmhumansurvey - garyyufei

Aligning Large Language Models with Human: A Survey
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment, arXiv, 2307.12950, arxiv, pdf, cication: 5

Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian
AlpaGasus: Training A Better Alpaca with Fewer Data, arXiv, 2307.08701, arxiv, pdf, cication: 11

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang · (lichang-chen.github)
Instruction Mining: When Data Mining Meets Large Language Model Finetuning, arXiv, 2307.06290, arxiv, pdf, cication: 3

Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning, arXiv, 2307.03692, arxiv, pdf, cication: 2

Waseem AlShikh, Manhal Daaboul, Kirk Goddard, Brock Imel, Kiran Kamble, Parikshith Kulkarni, Melisa Russak
Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts, arXiv, 2307.02768, arxiv, pdf, cication: 2

Mounica Maddela, Megan Ung, Jing Xu, Andrea Madotto, Heather Foran, Y-Lan Boureau
Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control, arXiv, 2307.00117, arxiv, pdf, cication: 3

Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine
On the Exploitability of Instruction Tuning, arXiv, 2306.17194, arxiv, pdf, cication: 4

Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
Are aligned neural networks adversarially aligned?, arXiv, 2306.15447, arxiv, pdf, cication: 30

Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer
Constitutional AI: Harmlessness from AI Feedback, arXiv, 2212.08073, arxiv, pdf, cication: 249

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon
A General Language Assistant as a Laboratory for Alignment, arXiv, 2112.00861, arxiv, pdf, cication: 61

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma

Other

Alignment Guidebook
Stanford CS25: V3 I Recipe for Training Helpful Chatbots - YouTube
The History of Open-Source LLMs: Imitation and Alignment (Part Three)
Teach Llamas to Talk: Recent Progress in Instruction Tuning

· (jiqizhixin)
想研究大模型Alignment，你只需要看懂这几篇paper - 知乎
OpenAI超级对齐负责人Jan Leike：如何破解对齐难题？用可扩展监督
有被混合后的SFT数据伤到
OpenAI的Superalignment策略：计算为王
当 OpenAI 说 Superalignment 说的是什么
用AI对齐AI？超级对齐团队领导人详解OpenAI对齐超级智能四年计划 | 机器之心
领域大模型-训练Trick&落地思考

Awesome RLHF

Survey

A Survey of Reinforcement Learning from Human Feedback, arXiv, 2312.14925, arxiv, pdf, cication: 5

Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier

Papers

Understanding Reference Policies in Direct Preference Optimization, arXiv, 2407.13709, arxiv, pdf, cication: -1

Yixin Liu, Pengfei Liu, Arman Cohan · (refdpo - yale-nlp)
Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning, arXiv, 2407.15762, arxiv, pdf, cication: -1

Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning, arXiv, 2407.00782, arxiv, pdf, cication: -1

Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li

· (Step-Controlled_DPO - mathllm)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs, arXiv, 2406.18629, arxiv, pdf, cication: -1

Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya Jia

· (Step-DPO - dvlab-research)
WARP: On the Benefits of Weight Averaged Rewarded Policies, arXiv, 2406.16768, arxiv, pdf, cication: -1

Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem
Bootstrapping Language Models with DPO Implicit Rewards, arXiv, 2406.09760, arxiv, pdf, cication: -1

Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin · (dice - sail-sg)
WPO: Enhancing RLHF with Weighted Preference Optimization, arXiv, 2406.11827, arxiv, pdf, cication: -1

Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu · (WPO - wzhouad)
mDPO: Conditional Preference Optimization for Multimodal Large Language Models, arXiv, 2406.11839, arxiv, pdf, cication: -1

Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models, arXiv, 2406.10162, arxiv, pdf, cication: -1

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs, arXiv, 2406.08657, arxiv, pdf, cication: -1

Chen Zheng, Ke Sun, Xun Zhou
HelpSteer2: Open-source dataset for training top-performing reward models, arXiv, 2406.08673, arxiv, pdf, cication: -1

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev · (NeMo-Aligner - NVIDIA) · (huggingface)
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback, arXiv, 2406.09279, arxiv, pdf, cication: -1

Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi · (EasyLM - hamishivi)
Discovering Preference Optimization Algorithms with and for Large Language Models, arXiv, 2406.08414, arxiv, pdf, cication: -1

Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, Robert Tjarko Lange · (DiscoPOP - SakanaAI)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment, arXiv, 2405.19332, arxiv, pdf, cication: -1

Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang · (SELM - shenao-zhang)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF, arXiv, 2405.19320, arxiv, pdf, cication: -1

Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai
SimPO: Simple Preference Optimization with a Reference-Free Reward, arXiv, 2405.14734, arxiv, pdf, cication: -1

Yu Meng, Mengzhou Xia, Danqi Chen
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework, arXiv, 2405.11143, arxiv, pdf, cication: -1

Jian Hu, Xibin Wu, Weixun Wang, Xianyu, Dehao Zhang, Yu Cao · (OpenRLHF - OpenLLMAI)
RLHF Workflow: From Reward Modeling to Online RLHF, arXiv, 2405.07863, arxiv, pdf, cication: -1

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang · (Online-RLHF - RLHFlow) · (RLHF-Reward-Modeling - RLHFlow)

· (huggingface)
Self-Play Preference Optimization for Language Model Alignment, arXiv, 2405.00675, arxiv, pdf, cication: -1

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
Iterative Reasoning Preference Optimization, arXiv, 2404.19733, arxiv, pdf, cication: -1

Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks, arXiv, 2404.14723, arxiv, pdf, cication: -1

Amir Saeidi, Shivanshu Verma, Chitta Baral
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function, arXiv, 2404.12358, arxiv, pdf, cication: -1

Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study, arXiv, 2404.10719, arxiv, pdf, cication: -1

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment, arXiv, 2404.12318, arxiv, pdf, cication: -1

Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami
Dataset Reset Policy Optimization for RLHF, arXiv, 2404.08495, arxiv, pdf, cication: -1

Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun · (drpo - Cornell-RL)
RewardBench: Evaluating Reward Models for Language Modeling, arXiv, 2403.13787, arxiv, pdf, cication: -1

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi
- a benchmark dataset and toolkit designed for the comprehensive evaluation of reward models used in RLHF
reward-bench - allenai

RewardBench: the first evaluation tool for reward models. · (huggingface) · (twitter)
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback, arXiv, 2404.00934, arxiv, pdf, cication: -1

Zhenyu Hou, Yilin Niu, Zhengxiao Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang
sDPO: Don't Use Your Data All at Once, arXiv, 2403.19270, arxiv, pdf, cication: -1

Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization, arXiv, 2403.17031, arxiv, pdf, cication: -1

Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, Lewis Tunstall

· (summarize_from_feedback_details - vwxyzjn) · (huggingface) · (twitter)
PERL: Parameter Efficient Reinforcement Learning from Human Feedback, arXiv, 2403.10704, arxiv, pdf, cication: -1

Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu
- (PERL) using Low-Rank Adaptation (LoRA) for training models with Reinforcement Learning from Human Feedback (RLHF), a method that aligns pretrained base LLMs with human preferences efficiently.
ORPO: Monolithic Preference Optimization without Reference Model, arXiv, 2403.07691, arxiv, pdf, cication: -1

Jiwoo Hong, Noah Lee, James Thorne · (orpo - xfactlab)
Teaching Large Language Models to Reason with Reinforcement Learning, arXiv, 2403.04642, arxiv, pdf, cication: -1

Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs, arXiv, 2402.14740, arxiv, pdf, cication: -1

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, Sara Hooker
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive, arXiv, 2402.13228, arxiv, pdf, cication: -1

Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White
A Critical Evaluation of AI Feedback for Aligning Large Language Models, arXiv, 2402.12366, arxiv, pdf, cication: -1

Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar
RLVF: Learning from Verbal Feedback without Overgeneralization, arXiv, 2402.10893, arxiv, pdf, cication: -1

Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
A Minimaximalist Approach to Reinforcement Learning from Human Feedback, arXiv, 2401.04056, arxiv, pdf, cication: 4

Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal · (jiqizhixin)
Suppressing Pink Elephants with Direct Principle Feedback, arXiv, 2402.07896, arxiv, pdf, cication: -1

Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman
ODIN: Disentangled Reward Mitigates Hacking in RLHF, arXiv, 2402.07319, arxiv, pdf, cication: -1

Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
LiPO: Listwise Preference Optimization through Learning-to-Rank, arXiv, 2402.01878, arxiv, pdf, cication: -1

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback, arXiv, 2402.01391, arxiv, pdf, cication: -1

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan
Transforming and Combining Rewards for Aligning Large Language Models, arXiv, 2402.00742, arxiv, pdf, cication: -1

Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch
Aligning Large Language Models with Counterfactual DPO, arXiv, 2401.09566, arxiv, pdf, cication: -1

Bradley Butcher
WARM: On the Benefits of Weight Averaged Reward Models, arXiv, 2401.12187, arxiv, pdf, cication: -1

Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity, arXiv, 2401.01967, arxiv, pdf, cication: 11

Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
ReFT: Reasoning with Reinforced Fine-Tuning, arXiv, 2401.08967, arxiv, pdf, cication: -1

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
Self-Rewarding Language Models, arXiv, 2401.10020, arxiv, pdf, cication: -1

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation, arXiv, 2401.08417, arxiv, pdf, cication: -1

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim
Secrets of RLHF in Large Language Models Part II: Reward Modeling, arXiv, 2401.06080, arxiv, pdf, cication: -1

Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi

· (jiqizhixin)
ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers, arXiv, 2401.02072, arxiv, pdf, cication: -1

Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou
InstructVideo: Instructing Video Diffusion Models with Human Feedback, arXiv, 2312.12490, arxiv, pdf, cication: -1

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
Silkie: Preference Distillation for Large Visual Language Models, arXiv, 2312.10665, arxiv, pdf, cication: -1

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
Align on the Fly: Adapting Chatbot Behavior to Established Norms, arXiv, 2312.15907, arxiv, pdf, cication: -1

Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu · (jiqizhixin) · (OPO - GAIR-NLP) · (gair-nlp.github)
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking, arXiv, 2312.09244, arxiv, pdf, cication: -1

Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models, arXiv, 2312.06585, arxiv, pdf, cication: -1

Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi
HALOs - ContextualAI

Human-Centered Loss Functions (HALOs) · (HALOs - ContextualAI)
Axiomatic Preference Modeling for Longform Question Answering, arXiv, 2312.02206, arxiv, pdf, cication: -1

Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett · (huggingface)
Nash Learning from Human Feedback, arXiv, 2312.00886, arxiv, pdf, cication: -1

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback, arXiv, 2312.00849, arxiv, pdf, cication: -1

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun · (RLHF-V - RLHF-V)
Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
Adversarial Preference Optimization, arXiv, 2311.08045, arxiv, pdf, cication: -1

Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du

· (mp.weixin.qq)
Diffusion Model Alignment Using Direct Preference Optimization, arXiv, 2311.12908, arxiv, pdf, cication: -1

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training, arXiv, 2311.04155, arxiv, pdf, cication: -1

Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang · (bpo - thu-coai)
Towards Understanding Sycophancy in Language Models, arXiv, 2310.13548, arxiv, pdf, cication: -1

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston · (jiqizhixin)
Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh · (jiqizhixin)
Don't throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding, arXiv, 2309.15028, arxiv, pdf, cication: 1

Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz · (jiqizhixin)
The N Implementation Details of RLHF with PPO
Specific versus General Principles for Constitutional AI, arXiv, 2310.13798, arxiv, pdf, cication: 1

Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean
Contrastive Preference Learning: Learning from Human Feedback without RL, arXiv, 2310.13639, arxiv, pdf, cication: -1

Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
A General Theoretical Paradigm to Understand Learning from Human Preferences, arXiv, 2310.12036, arxiv, pdf, cication: 1

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
Tuna: Instruction Tuning using Feedback from Large Language Models, arXiv, 2310.13385, arxiv, pdf, cication: -1

Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei
Safe RLHF: Safe Reinforcement Learning from Human Feedback, arXiv, 2310.12773, arxiv, pdf, cication: 1

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models, arXiv, 2310.10505, arxiv, pdf, cication: -1

Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo · (jiqizhixin)
Rethinking the Role of PPO in RLHF – The Berkeley Artificial Intelligence Research Blog
Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond, arXiv, 2310.06147, arxiv, pdf, cication: -1

Hao Sun
A Long Way to Go: Investigating Length Correlations in RLHF, arXiv, 2310.03716, arxiv, pdf, cication: 3

Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett
Aligning Large Multimodal Models with Factually Augmented RLHF, arXiv, 2309.14525, arxiv, pdf, cication: 4

Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang
Stabilizing RLHF through Advantage Model and Selective Rehearsal, arXiv, 2309.10202, arxiv, pdf, cication: 1

Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
Statistical Rejection Sampling Improves Preference Optimization, arXiv, 2309.06657, arxiv, pdf, cication: -1

Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu
Efficient RLHF: Reducing the Memory Usage of PPO, arXiv, 2309.00754, arxiv, pdf, cication: 1

Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback, arXiv, 2309.00267, arxiv, pdf, cication: 24

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · (mp.weixin.qq)
Reinforced Self-Training (ReST) for Language Modeling, arXiv, 2308.08998, arxiv, pdf, cication: 12

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu · (jiqizhixin)
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales, arXiv, 2308.01320, arxiv, pdf, cication: 4

Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, arXiv, 2307.15217, arxiv, pdf, cication: 36

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire · (jiqizhixin)
ICML '23 Tutorial on Reinforcement Learning from Human Feedback

· (openlmlab.github) · (mp.weixin.qq)
Fine-Tuning Language Models with Advantage-Induced Policy Alignment, arXiv, 2306.02231, arxiv, pdf, cication: 5

Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
System-Level Natural Language Feedback, arXiv, 2306.13588, arxiv, pdf, cication: 1

Weizhe Yuan, Kyunghyun Cho, Jason Weston
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training, arXiv, 2306.01693, arxiv, pdf, cication: 7

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi · (finegrainedrlhf.github) · (qbitai)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model, arXiv, 2305.18290, arxiv, pdf, cication: -1

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Let's Verify Step by Step, arXiv, 2305.20050, arxiv, pdf, cication: 76

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arXiv, 2204.05862, arxiv, pdf, cication: 109

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan · (hh-rlhf - anthropics)
Training language models to follow instructions with human feedback, NeurIPS, 2022, arxiv, pdf, cication: 6793

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray
Learning to summarize from human feedback, NeurIPS, 2020, arxiv, pdf, cication: 1122

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano

Projects

Fetching Title#irv0
Fetching Title#uj49
PairRM - llm-blender 🤗
OpenRLHF - OpenLLMAI

A Ray-based High-performance RLHF framework (for 7B on RTX4090 and 34B on A100)
direct-preference-optimization - eric-mitchell

Reference implementation for DPO (Direct Preference Optimization)
trl - huggingface

Train transformer language models with reinforcement learning.
tril - cornell-rl

Other

Putting RL back in RLHF
youtube.com/channel/UCCS4pukb_YnXdt40tptbCeg
History of Open Alignment
Alignment Guidebook
Constitutional AI with Open LLMs
Preference Tuning LLMs with Direct Preference Optimization Methods

· (jiqizhixin)
Reinforcement Learning from Human Feedback - DeepLearning.AI
Reinforcement Learning for Language Models - yoavg
The Q* hypothesis: Tree-of-thoughts reasoning, process reward models, and supercharging synthetic data
reverse engineer the Q* fantasy
Fine-tune Llama 2 with DPO
LLM Training: RLHF and Its Alternatives

· (mp.weixin.qq)
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
RLHF 101
大模型对齐阶段的Scaling Laws
RLHF中Reward model的trick
怎样让 PPO 训练更稳定？早期人类征服 RLHF 的驯化经验
RLHF实践 - 知乎

· (mp.weixin.qq)
LLM成功不可或缺的基石：RLHF及其替代技术 | 机器之心

Extra reference

awesome-RLHF - opendilab

A curated list of reinforcement learning with human feedback resources (continually updated)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_llm_alignment.md

awesome_llm_alignment.md

LLM Alignment

Survey

Paper & Projects

Other

Awesome RLHF

Survey

Papers

Projects

Other

Extra reference

Files

awesome_llm_alignment.md

Latest commit

History

awesome_llm_alignment.md

File metadata and controls

LLM Alignment

Survey

Paper & Projects

Other

Awesome RLHF

Survey

Papers

Projects

Other

Extra reference