👋 This is a collection of papers, surveys, etc for the research of language model alignments and beyond, covering learning from human feedback, interactive NLP, and language model alignments.
-
Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, Defu Lian, Enhong Chen. When Large Language Models Meet Personalization: Perspectives of Challenges and Opportunities. arXiv preprint 2023
-
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, William Yang Wang. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. arXiv preprint 2023
-
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, Hang Li. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment. arXiv preprint 2023
-
Yufei Wang, Wanjun Zhong, Liangyou Li, Fei Mi, Xingshan Zeng, Wenyong Huang, Lifeng Shang, Xin Jiang, Qun Liu. Aligning Large Language Models with Human: A Survey. arXiv preprint 2023
-
Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, Qingqing Zhu, Zhenzhu Yang, Adam Nik, Qi Liu, Chenghua Lin, Shi Wang, Ruibo Liu, Wenhu Chen, Ke Xu, Dayiheng Liu, Yike Guo, Jie Fu. Interactive Natural Language Processing. arXiv preprint 2023
-
Zijie J. Wang and Dongjin Choi and Shenyu Xu and Diyi Yang. Putting Humans in the Natural Language Processing Loop: {A} Survey. CoRR, abs/2103.04044, 2021
-
Settles, Burr. Active learning literature survey. arXiv, 0, 2009
- PandaLM: Reproducible and Automated Language Model Assessment
- Constrained Value-Aligned LLM via Safe RLHF
- Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings
- AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
- An Automatic Evaluator for Instruction-following Language Models
- Large-scale, Informative, and Diverse Multi-round Dialogue Data, and Models
- The Open Orca Dataset
- 面向中文大模型价值观的评估与对齐研究
- LEval
- DeepSpeed-Chat
-
H Dong, W Xiong, D Goyal, R Pan, S Diao, J Zhang, K Shum, T Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint 2023
-
Jian Hu, Li Tao, June Yang, Chandler Zhou. Aligning Language Models with Offline Reinforcement Learning from Human Feedback. arXiv preprint 2023
-
AN Lee, CJ Hunter, N Ruiz. Platypus: Quick, Cheap, and Powerful Refinement of LLMs. arXiv preprint 2023
-
Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeonbin Hwang, Seungone Kim, Yongrae Jo, James Thorne, Juho Kim, Minjoon Seo. FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets. arXiv preprint 2023
-
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, Wei Wang. SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models. arXiv preprint 2023
-
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian. RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment. arXiv preprint 2023
-
Tomohiro Sawada, Daniel Paleka, Alexander Havrilla, Pranav Tadepalli, Paula Vidas, Alexander Kranias, John J. Nay, Kshitij Gupta, Aran Komatsuzaki. ARB: Advanced Reasoning Benchmark for Large Language Models. arXiv preprint 2023
-
Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein. Bring Your Own Data! Self-Supervised Evaluation for Large Language Models. arXiv preprint 2023
-
Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, Xipeng Qiu. L-Eval: Instituting Standardized Evaluation for Long Context Language Models. arXiv preprint 2023
-
Shihao Liang, Kunlun Zhu, Runchu Tian, Yujia Qin, Huadong Wang, Xin Cong, Zhiyuan Liu, Xiaojiang Liu, Maosong Sun. Exploring Format Consistency for Instruction Tuning. arXiv preprint 2023
-
Ruosen Li, Teerth Patel, Xinya Du. PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations. arXiv preprint 2023
-
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, Chang Zhou. Scaling Relationship on Learning Mathematical Reasoning with Large Language Models. arXiv preprint 2023
-
Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Wang, Samuel Marks, Charbel-Raphaël Segerie, Micah Carroll, Andi Peng, Phillip Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Bıyık, Anca Dragan, David Krueger, Dorsa Sadigh, Dylan Hadfield-Menell. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv preprint 2023
-
Siddhartha Jain, Xiaofei Ma, Anoop Deoras, Bing Xiang. Self-consistency for open-ended generations. arXiv preprint 2023
-
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li, Houfeng Wang. Preference Ranking Optimization for Human Alignment. arXiv preprint 2023
-
Markus Anderljung, Joslyn Barnhart, Anton Korinek, Jade Leung, Cullen O'Keefe, Jess Whittlestone, Shahar Avin, Miles Brundage, Justin Bullock, Duncan Cass-Beggs, Ben Chang, Tantum Collins, Tim Fist, Gillian Hadfield, Alan Hayes, Lewis Ho, Sara Hooker, Eric Horvitz, Noam Kolt, Jonas Schuett, Yonadav Shavit, Divya Siddarth, Robert Trager, Kevin Wolf. Frontier AI Regulation: Managing Emerging Risks to Public Safety arXiv preprint 2023
-
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin. AlpaGasus: Training A Better Alpaca with Fewer Data arXiv preprint 2023
-
Wenxuan Zhang, Sharifah Mahani Aljunied, Chang Gao, Yew Ken Chia, Lidong Bing M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models arXiv preprint 2023
-
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, Adam Roberts. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning arXiv preprint 2023
-
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah. Orca: Progressive Learning from Complex Explanation Traces of GPT-4 arXiv preprint 2023
-
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song. The False Promise of Imitating Proprietary LLMs. arXiv preprint 2023
-
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, Hannaneh Hajishirzi. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. arXiv preprint 2023
-
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. arXiv preprint 2023
-
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, Zhifang Sui. Large Language Models are not Fair Evaluators. arXiv preprint 2023
-
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jian. WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv preprint 2023
-
Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi. How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources. arXiv preprint 2023
-
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization. arXiv preprint 2023
-
Yew Ken Chia, Pengfei Hong, Lidong Bing, Soujanya Poria. INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models. arXiv preprint 2023
-
Yuxin Jiang, Chunkit Chan, Mingyang Chen, Wei Wang. Lion: Adversarial Distillation of Closed-Source Large Language Model. arXiv preprint 2023
-
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Fin. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv preprint 2023
Comparing to PPO, DPO directly uses the preference data to optimize the model, without learning a reward model. Thus, the drawback of DPO is that DPO can not utilize data without human preference. You can understand DPO as a supervised learning method, but PPO is a semi-supervised learning method.
-
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi. Training Socially Aligned Language Models in Simulated Human Society. arXiv preprint 2023
-
Da Yin, Xiao Liu, Fan Yin, Ming Zhong, Hritik Bansal, Jiawei Han, Kai-Wei Chang. Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation. arXiv preprint 2023
-
Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, Minjoon Se Aligning Large Language Models through Synthetic Feedback. arXiv preprint 2023
-
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy. LIMA: Less Is More for Alignment. arXiv preprint 2023
-
Yuan Z, Yuan H, Tan C, Wang W, Huang S, Huang F. RRHF: Rank Responses to Align Language Models with Human Feedback without tears. arXiv preprint 2023
-
Sun Z, Shen Y, Zhou Q, Zhang H, Chen Z, Cox D, Yang Y, Gan C. Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision. arXiv preprint 2023
-
Wang Y, Kordi Y, Mishra S, Liu A, Smith NA, Khashabi D, Hajishirzi H. Self-Instruct: Aligning Language Model with Self Generated Instructions. ACL 2023.
-
Zhao Y, Joshi R, Liu T, Khalman M, Saleh M, Liu PJ. SLiC-HF: Sequence Likelihood Calibration with Human Feedback. arXiv preprint arXiv:2305.10425. 2023
-
Yan H, Srivastava S, Tai Y, Wang SI, Yih WT, Yao Z. Learning to Simulate Natural Language Feedback for Interactive Semantic Parsing. ACL 2023
-
Akyürek AF, Akyürek E, Madaan A, Kalyan A, Clark P, Wijaya D, Tandon N. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. ACL 2023
-
Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, Ethan Perez. Training Language Models with Language Feedback at Scale. axXiv, 2023
-
Sean Welleck and Ximing Lu and Peter West and Faeze Brahman and Tianxiao Shen and Daniel Khashabi and Yejin Choi. Generating Sequences by Learning to Self-Correct. ICLR, 2023
-
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, etc. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
-
Kurt Shuster and Jing Xu and Mojtaba Komeili and Da Ju and Eric Michael Smith and Stephen Roller and Megan Ung and Moya Chen and Kushal Arora and Joshua Lane and Morteza Behrooz and William Ngan and Spencer Poff and Naman Goyal and Arthur Szlam and YLan Boureau and Melanie Kambadur and Jason Weston. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. CoRR, abs/2208.03188, 2022
-
Rongzhi Zhang and Yue Yu and Pranav Shetty and Le Song and Chao Zhang. PRBoost: Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning. ACL, 2022
-
Mina Lee and Megha Srivastava and Amelia Hardy and John Thickstun and Esin Durmus and Ashwin Paranjape and Ines Gerard-Ursin and Xiang Lisa Li and Faisal Ladhak and Frieda Rong and Rose E. Wang and Minae Kwon and Joon Sung Park and Hancheng Cao and Tony Lee and Rishi Bommasani and Michael S. Bernstein and Percy Liang. Evaluating Human-Language Model Interaction. CoRR, abs/2212.09746, 2022
-
Long Ouyang and Jeff Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe. Training language models to follow instructions with human feedback. CoRR, abs/2203.02155, 2022
-
Ge Gao and Eunsol Choi and Yoav Artzi. Simulating Bandit Learning from User Feedback for Extractive Question Answering. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland, May 22-27, 2022
-
William Saunders and Catherine Yeh and Jeff Wu and Steven Bills and Long Ouyang and Jonathan Ward and Jan Leike. Self-critiquing models for assisting human evaluators. CoRR, abs/2206.05802, 2022
-
Krishna, Ranjay and Lee, Donsuk and Fei-Fei, Li and Bernstein, Michael S. Socially situated artificial intelligence enables learning from human interaction. Proceedings of the National Academy of Sciences, 119, 2022
-
Jeff Wu and Long Ouyang and Daniel M. Ziegler and Nisan Stiennon and Ryan Lowe and Jan Leike and Paul F. Christiano. Recursively Summarizing Books with Human Feedback. CoRR, abs/2109.10862, 2021
-
Reiichiro Nakano and Jacob Hilton and Suchir Balaji and Jeff Wu and Long Ouyang and Christina Kim and Christopher Hesse and Shantanu Jain and Vineet Kosaraju and William Saunders and Xu Jiang and Karl Cobbe and Tyna Eloundou and Gretchen Krueger and Kevin Button and Matthew Knight and Benjamin Chess and John Schulman. WebGPT: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332, 2021
-
Vania Mendonca and Ricardo Rei and Luisa Coheur and Alberto Sardinha and Ana Lucia Santos. Online Learning Meets Machine Translation Evaluation: Finding the Best Systems with the Least Human Effort. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, {ACL/IJCNLP} 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021
-
Noriyuki Kojima and Alane Suhr and Yoav Artzi. Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior. Trans. Assoc. Comput. Linguistics, 9, 2021
-
Ahmed Elgohary and Christopher Meek and Matthew Richardson and Adam Fourney and Gonzalo A. Ramos and Ahmed Hassan Awadallah. {NL-EDIT:} Correcting Semantic Parse Errors through Natural Language Interaction. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2021, Online, June 6-11, 2021
-
Tobias Falke and Patrick Lehnen. Feedback Attribution for Counterfactual Bandit Learning in Multi-Domain Spoken Language Understanding. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021
-
Ahmed Elgohary and Saghar Hosseini and Ahmed Hassan Awadallah. Speak to your Parser: Interactive Text-to-SQL with Natural Language Feedback. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, {ACL} 2020, Online, July 5-10, 2020
-
Liat Ein-Dor and Alon Halfon and Ariel Gera and Eyal Shnarch and Lena Dankin and Leshem Choshen and Marina Danilevsky and Ranit Aharonov and Yoav Katz and Noam Slonim. Active Learning for {BERT:} An Empirical Study. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2020, Online, November 16-20, 2020
-
Jon Ander Campos and Kyunghyun Cho and Arantxa Otegi and Aitor Soroa and Eneko Agirre and Gorka Azkune. Improving Conversational Question Answering Systems after Deployment using Feedback-Weighted Learning. Proceedings of the 28th International Conference on Computational Linguistics, {COLING} 2020, Barcelona, Spain (Online), December 8-13, 2020
-
Natasha Jaques and Judy Hanwen Shen and Asma Ghandeharioun and Craig Ferguson and Agata Lapedriza and Noah Jones and Shixiang Gu and Rosalind W. Picard. Human-centric dialog training via offline reinforcement learning. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2020, Online, November 16-20, 2020
-
Nisan Stiennon and Long Ouyang and Jeff Wu and Daniel M. Ziegler and Ryan Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul F. Christiano. Learning to summarize from human feedback. CoRR, abs/2009.01325, 2020
-
Ziyu Yao and Yiqi Tang and Wen-tau Yih and Huan Sun and Yu Su. An Imitation Game for Learning Semantic Parsers from User Interaction. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2020, Online, November 16-20, 2020
-
Bernhard Kratzwald and Stefan Feuerriegel and Huan Sun. Learning a Cost-Effective Annotation Policy for Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2020, Online, November 16-20, 2020
-
Julia Kreutzer and Stefan Riezler. Self-Regulated Interactive Sequence-to-Sequence Learning. Proceedings of the 57th Conference of the Association for Computational Linguistics, {ACL} 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers
-
Julia Kreutzer and Shahram Khadivi and Evgeny Matusov and Stefan Riezler. Can Neural Machine Translation be Improved with User Feedback?. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, {NAACL-HLT} 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 3 (Industry Papers)
-
Yang Gao and Christian M. Meyer and Iryna Gurevych. {APRIL:} Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018
-
Julia Kreutzer and Joshua Uyheng and Stefan Riezler. Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, {ACL} 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers
-
Carolin Lawrence and Stefan Riezler. Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, {ACL} 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers
-
Khanh Nguyen and Hal Daume III and Jordan L. Boyd-Graber. Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September 9-11, 2017
-
Artem Sokolov and Julia Kreutzer and Kellen Sunderland and Pavel Danchenko and Witold Szymaniak and Hagen Furstenau and Stefan Riezler. A Shared Task on Bandit Learning for Machine Translation. Proceedings of the Second Conference on Machine Translation, {WMT} 2017, Copenhagen, Denmark, September 7-8, 2017
-
Carolin Lawrence and Artem Sokolov and Stefan Riezler. Counterfactual Learning from Bandit Feedback under Deterministic Logging : {A} Case Study in Statistical Machine Translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2017, Copenhagen, Denmark, September 9-11, 2017
-
Artem Sokolov and Julia Kreutzer and Christopher Lo and Stefan Riezler. Learning Structured Predictors from Bandit Feedback for Interactive {NLP}. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, {ACL} 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers
-
Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Andrei A. Rusu and Joel Veness and Marc G. Bellemare and Alex Graves and Martin A. Riedmiller and Andreas Fidjeland and Georg Ostrovski and Stig Petersen and Charles Beattie and Amir Sadik and Ioannis Antonoglou and Helen King and Dharshan Kumaran and Daan Wierstra and Shane Legg and Demis Hassabis. Human-level control through deep reinforcement learning. Nat., 518, 2015