-
A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More,
arXiv, 2407.16216
, arxiv, pdf, cication: -1Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu, Zhu, Xiang-Bo Mao, Sitaram Asur
-
Towards Scalable Automated Alignment of LLMs: A Survey,
arXiv, 2406.01252
, arxiv, pdf, cication: -1Boxi Cao, Keming Lu, Xinyu Lu, Jiawei Chen, Mengjie Ren, Hao Xiang, Peilin Liu, Yaojie Lu, Ben He, Xianpei Han
-
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study,
arXiv, 2404.10719
, arxiv, pdf, cication: -1Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
-
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models,
arXiv, 2403.04204
, arxiv, pdf, cication: -1Xinpeng Wang, Shitong Duan, Xiaoyuan Yi, Jing Yao, Shanlin Zhou, Zhihua Wei, Peng Zhang, Dongkuan Xu, Maosong Sun, Xing Xie
-
AI Alignment: A Comprehensive Survey,
arXiv, 2310.19852
, arxiv, pdf, cication: 1Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang
-
Instruction Tuning for Large Language Models: A Survey,
arXiv, 2308.10792
, arxiv, pdf, cication: 19Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu · (mp.weixin.qq)
-
Large Language Model Alignment: A Survey,
arXiv, 2309.15025
, arxiv, pdf, cication: 3Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, Deyi Xiong · (jiqizhixin) · (llm-alignment-survey - Magnetic2014)
-
Aligning Large Language Models with Human: A Survey,
arXiv, 2307.12966
, arxiv, pdf, cication: 29Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu · (AlignLLMHumanSurvey - GaryYufei)
-
Course-Correction: Safety Alignment Using Synthetic Preferences,
arXiv, 2407.16637
, arxiv, pdf, cication: -1Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu
-
Better Alignment with Instruction Back-and-Forth Translation,
arXiv, 2408.04614
, arxiv, pdf, cication: -1Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason Weston, Luke Zettlemoyer, Xian Li
-
Phi-3 Safety Post-Training: Aligning Language Models with a "Break-Fix" Cycle,
arXiv, 2407.13833
, arxiv, pdf, cication: -1Emman Haider, Daniel Perez-Becker, Thomas Portet, Piyush Madan, Amit Garg, David Majercak, Wen Wen, Dongwoo Kim, Ziyi Yang, Jianwen Zhang
-
Direct Preference Knowledge Distillation for Large Language Models,
arXiv, 2406.19774
, arxiv, pdf, cication: -1Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei
-
On scalable oversight with weak LLMs judging strong LLMs,
arXiv, 2407.04622
, arxiv, pdf, cication: -1Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman
-
Creativity Has Left the Chat: The Price of Debiasing Language Models,
arXiv, 2406.05587
, arxiv, pdf, cication: -1Behnam Mohammadi
-
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms,
arXiv, 2406.02900
, arxiv, pdf, cication: -1Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum
-
Self-Improving Robust Preference Optimization,
arXiv, 2406.01660
, arxiv, pdf, cication: -1Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar
-
Show, Don't Tell: Aligning Language Models with Demonstrated Feedback,
arXiv, 2406.00888
, arxiv, pdf, cication: -1Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, Diyi Yang
· (demonstrated-feedback - SALT-NLP)
-
Xwin-LM: Strong and Scalable Alignment Practice for LLMs,
arXiv, 2405.20335
, arxiv, pdf, cication: -1Bolin Ni, JingCheng Hu, Yixuan Wei, Houwen Peng, Zheng Zhang, Gaofeng Meng, Han Hu · (Xwin-LM - Xwin-LM)
-
Offline Regularised Reinforcement Learning for Large Language Models Alignment,
arXiv, 2405.19107
, arxiv, pdf, cication: -1Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth
-
FLAME: Factuality-Aware Alignment for Large Language Models,
arXiv, 2405.01525
, arxiv, pdf, cication: -1Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen
-
NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment,
arXiv, 2405.01481
, arxiv, pdf, cication: -1Gerald Shen, Zhilin Wang, Olivier Delalleau, Jiaqi Zeng, Yi Dong, Daniel Egert, Shengyang Sun, Jimmy Zhang, Sahil Jain, Ali Taghibakhshi · (NeMo-Aligner - NVIDIA)
-
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions,
arXiv, 2404.13208
, arxiv, pdf, cication: -1Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, Alex Beutel
-
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data,
arXiv, 2404.12195
, arxiv, pdf, cication: -1Chandeepa Dissanayake, Lahiru Lowe, Sachith Gunasekara, Yasiru Ratnayake · (huggingface)
-
Learn Your Reference Model for Real Good Alignment,
arXiv, 2404.09656
, arxiv, pdf, cication: -1Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, Daniil Gavrilov
-
Foundational Challenges in Assuring Alignment and Safety of Large Language Models,
arXiv, 2404.09932
, arxiv, pdf, cication: -1Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut
-
Verifiable by Design: Aligning Language Models to Quote from Pre-Training Data,
arXiv, 2404.03862
, arxiv, pdf, cication: -1Jingyu Zhang, Marc Marone, Tianjian Li, Benjamin Van Durme, Daniel Khashabi
-
CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues,
arXiv, 2404.03820
, arxiv, pdf, cication: -1Makesh Narsimhan Sreedhar, Traian Rebedea, Shaona Ghosh, Christopher Parisien
-
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences,
arXiv, 2404.03715
, arxiv, pdf, cication: -1Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, Tengyang Xie
-
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations,
arXiv, 2403.09704
, arxiv, pdf, cication: -1Swapnaja Achintalwar, Ioana Baldini, Djallel Bouneffouf, Joan Byamugisha, Maria Chang, Pierre Dognin, Eitan Farchi, Ndivhuwo Makondo, Aleksandra Mojsilovic, Manish Nagireddy
-
Instruction-tuned Language Models are Better Knowledge Learners,
arXiv, 2402.12847
, arxiv, pdf, cication: -1Zhengbao Jiang, Zhiqing Sun, Weijia Shi, Pedro Rodriguez, Chunting Zhou, Graham Neubig, Xi Victoria Lin, Wen-tau Yih, Srinivasan Iyer
-
Reformatted Alignment,
arXiv, 2402.12219
, arxiv, pdf, cication: -1Run-Ze Fan, Xuefeng Li, Haoyang Zou, Junlong Li, Shwai He, Ethan Chern, Jiewen Hu, Pengfei Liu · (ReAlign - GAIR-NLP) · (gair-nlp.github)
· (qbitai)
-
Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction,
arXiv, 2402.02416
, arxiv, pdf, cication: -1Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Yaodong Yang · (jiqizhixin) · (aligner2024.github)
-
LESS: Selecting Influential Data for Targeted Instruction Tuning,
arXiv, 2402.04333
, arxiv, pdf, cication: -1Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen · (less - princeton-nlp)
· (qbitai)
-
Generative Representational Instruction Tuning,
arXiv, 2402.09906
, arxiv, pdf, cication: -1Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, Douwe Kiela
-
DeAL: Decoding-time Alignment for Large Language Models,
arXiv, 2402.06147
, arxiv, pdf, cication: -1James Y. Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchoff, Dan Roth
-
Direct Language Model Alignment from Online AI Feedback,
arXiv, 2402.04792
, arxiv, pdf, cication: -1Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot
-
Specialized Language Models with Cheap Inference from Limited Domain Data,
arXiv, 2402.01093
, arxiv, pdf, cication: -1David Grangier, Angelos Katharopoulos, Pierre Ablin, Awni Hannun
-
Human-Instruction-Free LLM Self-Alignment with Limited Samples,
arXiv, 2401.06785
, arxiv, pdf, cication: -1Hongyi Guo, Yuanshun Yao, Wei Shen, Jiaheng Wei, Xiaoying Zhang, Zhaoran Wang, Yang Liu
-
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation,
arXiv, 2312.14187
, arxiv, pdf, cication: -1Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin
-
Teach Llamas to Talk: Recent Progress in Instruction Tuning
· (mp.weixin.qq)
-
weak-to-strong - openai
· (openai) · (cdn.openai) · (jiqizhixin) · (mp.weixin.qq)
-
Alignment for Honesty,
arXiv, 2312.07000
, arxiv, pdf, cication: -1Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu · (alignment-for-honesty - GAIR-NLP)
-
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning,
arXiv, 2312.01552
, arxiv, pdf, cication: -1Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, Yejin Choi · (allenai.github)
· (jiqizhixin)
· (URIAL - Re-Align)
-
Instruction-tuning Aligns LLMs to the Human Brain,
arXiv, 2312.00575
, arxiv, pdf, cication: -1Khai Loong Aw, Syrielle Montariol, Badr AlKhamissi, Martin Schrimpf, Antoine Bosselut
-
wizardlm - nlpxucan
Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder
-
Trusted Source Alignment in Large Language Models,
arXiv, 2311.06697
, arxiv, pdf, cication: -1Vasilisa Bashlovkina, Zhaobin Kuang, Riley Matthews, Edward Clifford, Yennie Jun, William W. Cohen, Simon Baumgartner
-
AlignBench: Benchmarking Chinese Alignment of Large Language Models,
arXiv, 2311.18743
, arxiv, pdf, cication: 8Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam · (AlignBench - THUDM)
-
Zephyr: Direct Distillation of LM Alignment,
arXiv, 2310.16944
, arxiv, pdf, cication: 1Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib · (alignment-handbook - huggingface)
-
Controlled Decoding from Language Models,
arXiv, 2310.17022
, arxiv, pdf, cication: -1Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman
-
Auto-Instruct: Automatic Instruction Generation and Ranking for Black-Box Language Models,
arXiv, 2310.13127
, arxiv, pdf, cication: -1Zhihan Zhang, Shuohang Wang, Wenhao Yu, Yichong Xu, Dan Iter, Qingkai Zeng, Yang Liu, Chenguang Zhu, Meng Jiang
-
An Emulator for Fine-Tuning Large Language Models using Small Language Models,
arXiv, 2310.12962
, arxiv, pdf, cication: -1Eric Mitchell, Rafael Rafailov, Archit Sharma, Chelsea Finn, Christopher D. Manning
-
NEFTune: Noisy Embeddings Improve Instruction Finetuning,
arXiv, 2310.05914
, arxiv, pdf, cication: -1Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha · (qbitai)
-
alignment-handbook - huggingface
Robust recipes for to align language models with human and AI preferences
-
Xwin-LM - Xwin-LM
Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment · (mp.weixin.qq)
-
Self-Alignment with Instruction Backtranslation,
arXiv, 2308.06259
, arxiv, pdf, cication: 13Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis · (jiqizhixin)
-
Simple synthetic data reduces sycophancy in large language models,
arXiv, 2308.03958
, arxiv, pdf, cication: 7Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le
-
alignllmhumansurvey - garyyufei
Aligning Large Language Models with Human: A Survey
-
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment,
arXiv, 2307.12950
, arxiv, pdf, cication: 5Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian
-
AlpaGasus: Training A Better Alpaca with Fewer Data,
arXiv, 2307.08701
, arxiv, pdf, cication: 11Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang · (lichang-chen.github)
-
Instruction Mining: When Data Mining Meets Large Language Model Finetuning,
arXiv, 2307.06290
, arxiv, pdf, cication: 3Yihan Cao, Yanbin Kang, Chi Wang, Lichao Sun
-
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning,
arXiv, 2307.03692
, arxiv, pdf, cication: 2Waseem AlShikh, Manhal Daaboul, Kirk Goddard, Brock Imel, Kiran Kamble, Parikshith Kulkarni, Melisa Russak
-
Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts,
arXiv, 2307.02768
, arxiv, pdf, cication: 2Mounica Maddela, Megan Ung, Jing Xu, Andrea Madotto, Heather Foran, Y-Lan Boureau
-
Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control,
arXiv, 2307.00117
, arxiv, pdf, cication: 3Vivek Myers, Andre He, Kuan Fang, Homer Walke, Philippe Hansen-Estruch, Ching-An Cheng, Mihai Jalobeanu, Andrey Kolobov, Anca Dragan, Sergey Levine
-
On the Exploitability of Instruction Tuning,
arXiv, 2306.17194
, arxiv, pdf, cication: 4Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
-
Are aligned neural networks adversarially aligned?,
arXiv, 2306.15447
, arxiv, pdf, cication: 30Nicholas Carlini, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramer
-
Constitutional AI: Harmlessness from AI Feedback,
arXiv, 2212.08073
, arxiv, pdf, cication: 249Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon
-
A General Language Assistant as a Laboratory for Alignment,
arXiv, 2112.00861
, arxiv, pdf, cication: 61Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma
-
Stanford CS25: V3 I Recipe for Training Helpful Chatbots - YouTube
-
The History of Open-Source LLMs: Imitation and Alignment (Part Three)
-
Teach Llamas to Talk: Recent Progress in Instruction Tuning
· (jiqizhixin)
-
A Survey of Reinforcement Learning from Human Feedback,
arXiv, 2312.14925
, arxiv, pdf, cication: 5Timo Kaufmann, Paul Weng, Viktor Bengs, Eyke Hüllermeier
-
Understanding Reference Policies in Direct Preference Optimization,
arXiv, 2407.13709
, arxiv, pdf, cication: -1Yixin Liu, Pengfei Liu, Arman Cohan · (refdpo - yale-nlp)
-
Conditioned Language Policy: A General Framework for Steerable Multi-Objective Finetuning,
arXiv, 2407.15762
, arxiv, pdf, cication: -1Kaiwen Wang, Rahul Kidambi, Ryan Sullivan, Alekh Agarwal, Christoph Dann, Andrea Michi, Marco Gelmi, Yunxuan Li, Raghav Gupta, Avinava Dubey
-
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning,
arXiv, 2407.00782
, arxiv, pdf, cication: -1Zimu Lu, Aojun Zhou, Ke Wang, Houxing Ren, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li
· (Step-Controlled_DPO - mathllm)
-
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs,
arXiv, 2406.18629
, arxiv, pdf, cication: -1Xin Lai, Zhuotao Tian, Yukang Chen, Senqiao Yang, Xiangru Peng, Jiaya Jia
· (Step-DPO - dvlab-research)
-
WARP: On the Benefits of Weight Averaged Rewarded Policies,
arXiv, 2406.16768
, arxiv, pdf, cication: -1Alexandre Ramé, Johan Ferret, Nino Vieillard, Robert Dadashi, Léonard Hussenot, Pierre-Louis Cedoz, Pier Giuseppe Sessa, Sertan Girgin, Arthur Douillard, Olivier Bachem
-
Bootstrapping Language Models with DPO Implicit Rewards,
arXiv, 2406.09760
, arxiv, pdf, cication: -1Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin · (dice - sail-sg)
-
WPO: Enhancing RLHF with Weighted Preference Optimization,
arXiv, 2406.11827
, arxiv, pdf, cication: -1Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, Chenguang Zhu · (WPO - wzhouad)
-
mDPO: Conditional Preference Optimization for Multimodal Large Language Models,
arXiv, 2406.11839
, arxiv, pdf, cication: -1Fei Wang, Wenxuan Zhou, James Y. Huang, Nan Xu, Sheng Zhang, Hoifung Poon, Muhao Chen
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models,
arXiv, 2406.10162
, arxiv, pdf, cication: -1Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan
-
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs,
arXiv, 2406.08657
, arxiv, pdf, cication: -1Chen Zheng, Ke Sun, Xun Zhou
-
HelpSteer2: Open-source dataset for training top-performing reward models,
arXiv, 2406.08673
, arxiv, pdf, cication: -1Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy J. Zhang, Makesh Narsimhan Sreedhar, Oleksii Kuchaiev · (NeMo-Aligner - NVIDIA) · (huggingface)
-
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback,
arXiv, 2406.09279
, arxiv, pdf, cication: -1Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi · (EasyLM - hamishivi)
-
Discovering Preference Optimization Algorithms with and for Large Language Models,
arXiv, 2406.08414
, arxiv, pdf, cication: -1Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, Robert Tjarko Lange · (DiscoPOP - SakanaAI)
-
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment,
arXiv, 2405.19332
, arxiv, pdf, cication: -1Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang · (SELM - shenao-zhang)
-
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF,
arXiv, 2405.19320
, arxiv, pdf, cication: -1Shicong Cen, Jincheng Mei, Katayoon Goshvadi, Hanjun Dai, Tong Yang, Sherry Yang, Dale Schuurmans, Yuejie Chi, Bo Dai
-
SimPO: Simple Preference Optimization with a Reference-Free Reward,
arXiv, 2405.14734
, arxiv, pdf, cication: -1Yu Meng, Mengzhou Xia, Danqi Chen
-
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework,
arXiv, 2405.11143
, arxiv, pdf, cication: -1Jian Hu, Xibin Wu, Weixun Wang, Xianyu, Dehao Zhang, Yu Cao · (OpenRLHF - OpenLLMAI)
-
RLHF Workflow: From Reward Modeling to Online RLHF,
arXiv, 2405.07863
, arxiv, pdf, cication: -1Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang · (Online-RLHF - RLHFlow) · (RLHF-Reward-Modeling - RLHFlow)
· (huggingface)
-
Self-Play Preference Optimization for Language Model Alignment,
arXiv, 2405.00675
, arxiv, pdf, cication: -1Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, Quanquan Gu
-
Iterative Reasoning Preference Optimization,
arXiv, 2404.19733
, arxiv, pdf, cication: -1Richard Yuanzhe Pang, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, Jason Weston
-
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks,
arXiv, 2404.14723
, arxiv, pdf, cication: -1Amir Saeidi, Shivanshu Verma, Chitta Baral
-
From
$r$ to$Q^*$ : Your Language Model is Secretly a Q-Function,arXiv, 2404.12358
, arxiv, pdf, cication: -1Rafael Rafailov, Joey Hejna, Ryan Park, Chelsea Finn
-
Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study,
arXiv, 2404.10719
, arxiv, pdf, cication: -1Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu
-
Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment,
arXiv, 2404.12318
, arxiv, pdf, cication: -1Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami
-
Dataset Reset Policy Optimization for RLHF,
arXiv, 2404.08495
, arxiv, pdf, cication: -1Jonathan D. Chang, Wenhao Shan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun · (drpo - Cornell-RL)
-
RewardBench: Evaluating Reward Models for Language Modeling,
arXiv, 2403.13787
, arxiv, pdf, cication: -1Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi
a benchmark dataset and toolkit designed for the comprehensive evaluation of reward models used in RLHF
-
reward-bench - allenai
RewardBench: the first evaluation tool for reward models. · (huggingface) · (twitter)
-
ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback,
arXiv, 2404.00934
, arxiv, pdf, cication: -1Zhenyu Hou, Yilin Niu, Zhengxiao Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang
-
sDPO: Don't Use Your Data All at Once,
arXiv, 2403.19270
, arxiv, pdf, cication: -1Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park
-
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization,
arXiv, 2403.17031
, arxiv, pdf, cication: -1Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, Lewis Tunstall
· (summarize_from_feedback_details - vwxyzjn) · (huggingface) · (twitter)
-
PERL: Parameter Efficient Reinforcement Learning from Human Feedback,
arXiv, 2403.10704
, arxiv, pdf, cication: -1Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu
(PERL) using Low-Rank Adaptation (LoRA) for training models with Reinforcement Learning from Human Feedback (RLHF), a method that aligns pretrained base LLMs with human preferences efficiently.
-
ORPO: Monolithic Preference Optimization without Reference Model,
arXiv, 2403.07691
, arxiv, pdf, cication: -1Jiwoo Hong, Noah Lee, James Thorne · (orpo - xfactlab)
-
Teaching Large Language Models to Reason with Reinforcement Learning,
arXiv, 2403.04642
, arxiv, pdf, cication: -1Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu
-
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs,
arXiv, 2402.14740
, arxiv, pdf, cication: -1Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, Sara Hooker
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive,
arXiv, 2402.13228
, arxiv, pdf, cication: -1Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White
-
A Critical Evaluation of AI Feedback for Aligning Large Language Models,
arXiv, 2402.12366
, arxiv, pdf, cication: -1Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar
-
RLVF: Learning from Verbal Feedback without Overgeneralization,
arXiv, 2402.10893
, arxiv, pdf, cication: -1Moritz Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
-
A Minimaximalist Approach to Reinforcement Learning from Human Feedback,
arXiv, 2401.04056
, arxiv, pdf, cication: 4Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal · (jiqizhixin)
-
Suppressing Pink Elephants with Direct Principle Feedback,
arXiv, 2402.07896
, arxiv, pdf, cication: -1Louis Castricato, Nathan Lile, Suraj Anand, Hailey Schoelkopf, Siddharth Verma, Stella Biderman
-
ODIN: Disentangled Reward Mitigates Hacking in RLHF,
arXiv, 2402.07319
, arxiv, pdf, cication: -1Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro
-
LiPO: Listwise Preference Optimization through Learning-to-Rank,
arXiv, 2402.01878
, arxiv, pdf, cication: -1Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu
-
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback,
arXiv, 2402.01391
, arxiv, pdf, cication: -1Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan
-
Transforming and Combining Rewards for Aligning Large Language Models,
arXiv, 2402.00742
, arxiv, pdf, cication: -1Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch
-
Aligning Large Language Models with Counterfactual DPO,
arXiv, 2401.09566
, arxiv, pdf, cication: -1Bradley Butcher
-
WARM: On the Benefits of Weight Averaged Reward Models,
arXiv, 2401.12187
, arxiv, pdf, cication: -1Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
-
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity,
arXiv, 2401.01967
, arxiv, pdf, cication: 11Andrew Lee, Xiaoyan Bai, Itamar Pres, Martin Wattenberg, Jonathan K. Kummerfeld, Rada Mihalcea
-
ReFT: Reasoning with Reinforced Fine-Tuning,
arXiv, 2401.08967
, arxiv, pdf, cication: -1Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, Hang Li
-
Self-Rewarding Language Models,
arXiv, 2401.10020
, arxiv, pdf, cication: -1Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston
-
Contrastive Preference Optimization: Pushing the Boundaries of LLM Performance in Machine Translation,
arXiv, 2401.08417
, arxiv, pdf, cication: -1Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, Young Jin Kim
-
Secrets of RLHF in Large Language Models Part II: Reward Modeling,
arXiv, 2401.06080
, arxiv, pdf, cication: -1Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi
· (jiqizhixin)
-
ICE-GRT: Instruction Context Enhancement by Generative Reinforcement based Transformers,
arXiv, 2401.02072
, arxiv, pdf, cication: -1Chen Zheng, Ke Sun, Da Tang, Yukun Ma, Yuyu Zhang, Chenguang Xi, Xun Zhou
-
InstructVideo: Instructing Video Diffusion Models with Human Feedback,
arXiv, 2312.12490
, arxiv, pdf, cication: -1Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni
-
Silkie: Preference Distillation for Large Visual Language Models,
arXiv, 2312.10665
, arxiv, pdf, cication: -1Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, Lingpeng Kong
-
Align on the Fly: Adapting Chatbot Behavior to Established Norms,
arXiv, 2312.15907
, arxiv, pdf, cication: -1Chunpu Xu, Steffi Chern, Ethan Chern, Ge Zhang, Zekun Wang, Ruibo Liu, Jing Li, Jie Fu, Pengfei Liu · (jiqizhixin) · (OPO - GAIR-NLP) · (gair-nlp.github)
-
Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking,
arXiv, 2312.09244
, arxiv, pdf, cication: -1Jacob Eisenstein, Chirag Nagpal, Alekh Agarwal, Ahmad Beirami, Alex D'Amour, DJ Dvijotham, Adam Fisch, Katherine Heller, Stephen Pfohl, Deepak Ramachandran
-
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models,
arXiv, 2312.06585
, arxiv, pdf, cication: -1Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi
-
HALOs - ContextualAI
Human-Centered Loss Functions (HALOs) · (HALOs - ContextualAI)
-
Axiomatic Preference Modeling for Longform Question Answering,
arXiv, 2312.02206
, arxiv, pdf, cication: -1Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, Paul Bennett · (huggingface)
-
Nash Learning from Human Feedback,
arXiv, 2312.00886
, arxiv, pdf, cication: -1Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi
-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback,
arXiv, 2312.00849
, arxiv, pdf, cication: -1Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun · (RLHF-V - RLHF-V)
-
Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
-
Adversarial Preference Optimization,
arXiv, 2311.08045
, arxiv, pdf, cication: -1Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du
· (mp.weixin.qq)
-
Diffusion Model Alignment Using Direct Preference Optimization,
arXiv, 2311.12908
, arxiv, pdf, cication: -1Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
-
Black-Box Prompt Optimization: Aligning Large Language Models without Model Training,
arXiv, 2311.04155
, arxiv, pdf, cication: -1Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, Minlie Huang · (bpo - thu-coai)
-
Towards Understanding Sycophancy in Language Models,
arXiv, 2310.13548
, arxiv, pdf, cication: -1Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston · (jiqizhixin)
-
Contrastive Preference Learning: Learning from Human Feedback without RL,
arXiv, 2310.13639
, arxiv, pdf, cication: -1Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh · (jiqizhixin)
-
Don't throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding,
arXiv, 2309.15028
, arxiv, pdf, cication: 1Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru, Yejin Choi, Hannaneh Hajishirzi, Asli Celikyilmaz · (jiqizhixin)
-
Specific versus General Principles for Constitutional AI,
arXiv, 2310.13798
, arxiv, pdf, cication: 1Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean
-
Contrastive Preference Learning: Learning from Human Feedback without RL,
arXiv, 2310.13639
, arxiv, pdf, cication: -1Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
-
A General Theoretical Paradigm to Understand Learning from Human Preferences,
arXiv, 2310.12036
, arxiv, pdf, cication: 1Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
-
Tuna: Instruction Tuning using Feedback from Large Language Models,
arXiv, 2310.13385
, arxiv, pdf, cication: -1Haoran Li, Yiran Liu, Xingxing Zhang, Wei Lu, Furu Wei
-
Safe RLHF: Safe Reinforcement Learning from Human Feedback,
arXiv, 2310.12773
, arxiv, pdf, cication: 1Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
-
ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models,
arXiv, 2310.10505
, arxiv, pdf, cication: -1Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo · (jiqizhixin)
-
Rethinking the Role of PPO in RLHF – The Berkeley Artificial Intelligence Research Blog
-
Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond,
arXiv, 2310.06147
, arxiv, pdf, cication: -1Hao Sun
-
A Long Way to Go: Investigating Length Correlations in RLHF,
arXiv, 2310.03716
, arxiv, pdf, cication: 3Prasann Singhal, Tanya Goyal, Jiacheng Xu, Greg Durrett
-
Aligning Large Multimodal Models with Factually Augmented RLHF,
arXiv, 2309.14525
, arxiv, pdf, cication: 4Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang
-
Stabilizing RLHF through Advantage Model and Selective Rehearsal,
arXiv, 2309.10202
, arxiv, pdf, cication: 1Baolin Peng, Linfeng Song, Ye Tian, Lifeng Jin, Haitao Mi, Dong Yu
-
Statistical Rejection Sampling Improves Preference Optimization,
arXiv, 2309.06657
, arxiv, pdf, cication: -1Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu
-
Efficient RLHF: Reducing the Memory Usage of PPO,
arXiv, 2309.00754
, arxiv, pdf, cication: 1Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, Yelong Shen
-
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback,
arXiv, 2309.00267
, arxiv, pdf, cication: 24Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi · (mp.weixin.qq)
-
Reinforced Self-Training (ReST) for Language Modeling,
arXiv, 2308.08998
, arxiv, pdf, cication: 12Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu · (jiqizhixin)
-
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales,
arXiv, 2308.01320
, arxiv, pdf, cication: 4Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Ammar Ahmad Awan, Jeff Rasley, Minjia Zhang, Conglong Li, Connor Holmes
-
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback,
arXiv, 2307.15217
, arxiv, pdf, cication: 36Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire · (jiqizhixin)
-
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
· (openlmlab.github) · (mp.weixin.qq)
-
Fine-Tuning Language Models with Advantage-Induced Policy Alignment,
arXiv, 2306.02231
, arxiv, pdf, cication: 5Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong, Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
-
System-Level Natural Language Feedback,
arXiv, 2306.13588
, arxiv, pdf, cication: 1Weizhe Yuan, Kyunghyun Cho, Jason Weston
-
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training,
arXiv, 2306.01693
, arxiv, pdf, cication: 7Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi · (finegrainedrlhf.github) · (qbitai)
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model,
arXiv, 2305.18290
, arxiv, pdf, cication: -1Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
-
Let's Verify Step by Step,
arXiv, 2305.20050
, arxiv, pdf, cication: 76Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback,
arXiv, 2204.05862
, arxiv, pdf, cication: 109Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan · (hh-rlhf - anthropics)
-
Training language models to follow instructions with human feedback,
NeurIPS, 2022
, arxiv, pdf, cication: 6793Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray
-
Learning to summarize from human feedback,
NeurIPS, 2020
, arxiv, pdf, cication: 1122Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano
-
PairRM - llm-blender 🤗
-
OpenRLHF - OpenLLMAI
A Ray-based High-performance RLHF framework (for 7B on RTX4090 and 34B on A100)
-
direct-preference-optimization - eric-mitchell
Reference implementation for DPO (Direct Preference Optimization)
-
trl - huggingface
Train transformer language models with reinforcement learning.
-
tril - cornell-rl
-
Preference Tuning LLMs with Direct Preference Optimization Methods
· (jiqizhixin)
-
Reinforcement Learning from Human Feedback - DeepLearning.AI
-
LLM Training: RLHF and Its Alternatives
· (mp.weixin.qq)
-
ICML '23 Tutorial on Reinforcement Learning from Human Feedback
-
· (mp.weixin.qq)
-
awesome-RLHF - opendilab
A curated list of reinforcement learning with human feedback resources (continually updated)