Skip to content

Latest commit

 

History

History
517 lines (373 loc) · 51.8 KB

awesome_llm_data.md

File metadata and controls

517 lines (373 loc) · 51.8 KB

Awesome llm data

Survey

  • On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey, arXiv, 2406.15126, arxiv, pdf, cication: -1

    Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang

  • On Protecting the Data Privacy of Large Language Models (LLMs): A Survey, arXiv, 2403.05156, arxiv, pdf, cication: -1

    Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, Xiuzheng Cheng

  • The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources, arXiv, 2406.16746, arxiv, pdf, cication: -1

    Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San

  • fm-cheatsheet - allenai Star

    Website for hosting the Open Foundation Models Cheat Sheet. · (fm-cheatsheet - allenai) Star

    · (fmcheatsheet)

  • Datasets for Large Language Models: A Comprehensive Survey, arXiv, 2402.18041, arxiv, pdf, cication: -1

    Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin

    · (Awesome-LLMs-Datasets - lmmlzn) Star

  • A Survey on Data Selection for Language Models, arXiv, 2402.16827, arxiv, pdf, cication: -1

    Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong

  • Data Management For Large Language Models: A Survey, arXiv, 2312.01700, arxiv, pdf, cication: -1

    Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu

Techs

  • Data Contamination Report from the 2024 CONDA Shared Task, arXiv, 2407.21530, arxiv, pdf, cication: -1

    Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen · (conda-workshop.github)

  • DataComp-LM: In search of the next generation of training sets for language models, arXiv, 2406.11794, arxiv, pdf, cication: -1

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora · (datacomp)

    · (dclm - mlfoundations) Star

  • Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining, arXiv, 2405.14908, arxiv, pdf, cication: -1

    Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding

  • Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach, arXiv, 2405.15613, arxiv, pdf, cication: -1

    Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin

  • Dynamic data sampler for cross-language transfer learning in large language models, icassp 2024-2024 ieee international conference on acoustics …, 2024, arxiv, pdf, cication: -1

    Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen, Cheng Hou, Xianxu Hou

  • Fewer Truncations Improve Language Modeling, arXiv, 2404.10830, arxiv, pdf, cication: -1

    Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto

  • Best Practices and Lessons Learned on Synthetic Data for Language Models, arXiv, 2404.07503, arxiv, pdf, cication: -1

    Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou

  • Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic, arXiv, 2404.07177, arxiv, pdf, cication: -1

    Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter · (scaling_laws_data_filtering - locuslab) Star

  • Training LLMs over Neurally Compressed Text, arXiv, 2404.03626, arxiv, pdf, cication: -1

    Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

    • explores training LLMs with neural text compressors; the proposed compression technique segments text into blocks that each compress to the same bit length; the approach improves at scale and outperforms byte-level baselines on both perplexity and inference speed benchmarks; latency is reduced to the shorter sequence length.
  • LESS: Selecting Influential Data for Targeted Instruction Tuning, arXiv, 2402.04333, arxiv, pdf, cication: -1

    Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen · (cs.princeton)

  • Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?, arXiv, 2403.06833, arxiv, pdf, cication: -1

    Egor Zverev, Sahar Abdelnabi, Mario Fritz, Christoph H. Lampert · (Should-It-Be-Executed-Or-Processed - egozverev) Star

  • Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance, arXiv, 2403.16952, arxiv, pdf, cication: -1

    Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, Xipeng Qiu · (mixinglaws - yegcjs) Star

  • LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement, arXiv, 2403.15042, arxiv, pdf, cication: -1

    Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

    • improves the performance of large language models in low-data scenarios by using a teacher model to generate synthetic data from errors made by a student model during initial training
  • Are Human Conversations Special? A Large Language Model Perspective, arXiv, 2403.05045, arxiv, pdf, cication: -1

    Toshish Jawale, Chaitanya Animesh, Sekhar Vallath, Kartik Talamadupula, Larry Heck

  • Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation, arXiv, 2402.18334, arxiv, pdf, cication: -1

    Nihal V. Nayak, Yiyang Nan, Avi Trost, Stephen H. Bach · (bonito - batsresearch) Star

  • How to Train Data-Efficient LLMs, arXiv, 2402.09668, arxiv, pdf, cication: -1

    Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng

  • An Initial Exploration of Theoretical Support for Language Model Data Engineering. Part 1: Pretraining

  • Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling, arXiv, 2401.16380, arxiv, pdf, cication: -1

    Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly

  • Genie: Achieving Human Parity in Content-Grounded Datasets Generation, arXiv, 2401.14367, arxiv, pdf, cication: -1

    Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen

  • Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI, arXiv, 2401.14019, arxiv, pdf, cication: -1

    Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen · (unitxt - IBM) Star

  • The Unreasonable Effectiveness of Easy Training Data for Hard Tasks, arXiv, 2401.06751, arxiv, pdf, cication: -1

    Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe · (easy-to-hard-generalization - allenai) Star

  • A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism, arXiv, 2401.05749, arxiv, pdf, cication: -1

    Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico

  • What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning, arXiv, 2312.15685, arxiv, pdf, cication: -1

    Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, Junxian He · (deita - hkust-nlp) Star

  • Order Matters in the Presence of Dataset Imbalance for Multilingual Learning, arXiv, 2312.06134, arxiv, pdf, cication: -1

    Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani

  • When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale, arXiv, 2309.04564, arxiv, pdf, cication: 23

    Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker

  • AlpaGasus: Training A Better Alpaca with Fewer Data, arXiv, 2307.08701, arxiv, pdf, cication: -1

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang

  • DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining, NeurIPS, 2024, arxiv, pdf, cication: 34

    Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu

  • Scaling Data-Constrained Language Models, arXiv, 2305.16264, arxiv, pdf, cication: -1

    Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel

    · (datablations - huggingface) Star

Datasets

Misc

  • CharacterCodex - NousResearch 🤗

  • Zyda: A 1.3T Dataset for Open Language Modeling, arXiv, 2406.01981, arxiv, pdf, cication: -1

    Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony · (zyphra)

  • MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels, arXiv, 2405.07526, arxiv, pdf, cication: -1

    Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett · (MS-MARCO-Web-Search - microsoft) Star

  • WildChat: 1M ChatGPT Interaction Logs in the Wild, arXiv, 2405.01470, arxiv, pdf, cication: -1

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng

  • CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies, arXiv, 2404.15238, arxiv, pdf, cication: -1

    Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Chunhua yu, Raya Horesh, Rogério Abreu de Paula, Diyi Yang · (culturebank.github)

  • COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning, arXiv, 2403.18058, arxiv, pdf, cication: -1

    Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang · (huggingface) · (COIG-CQIA - paralym) Star

    · (qbitai)

  • 10k_prompts_ranked - DIBT 🤗

  • Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning, arXiv, 2402.06619, arxiv, pdf, cication: 1

    Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony

    · (youtube)

  • Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research, arXiv, 2402.00159, arxiv, pdf, cication: -1

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar

  • openhathi_instruct - pacman100 Star

    This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.

  • MADLAD-400: A Multilingual And Document-Level Large Audited Dataset, arXiv, 2309.04662, arxiv, pdf, cication: -1

    Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna

    · (google-research - google-research) Star

  • Phi-2: The surprising power of small language models - Microsoft Research

  • What's In My Big Data?, arXiv, 2310.20707, arxiv, pdf, cication: -1

    Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh

  • orca - nuochenpku Star

    Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension

  • UltraFeedback - OpenBMB Star

    A large-scale, fine-grained, diverse preference dataset (and models).

  • How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition, arXiv, 2310.05492, arxiv, pdf, cication: -1

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, Jingren Zhou

  • LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset, arXiv, 2309.11998, arxiv, pdf, cication: 3

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing

  • SlimPajama-DC: Understanding Data Combinations for LLM Training, arXiv, 2309.10818, arxiv, pdf, cication: -1

    Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva

  • CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages, arXiv, 2309.09400, arxiv, pdf, cication: -1

    Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen

  • Textbooks Are All You Need II: phi-1.5 technical report, arXiv, 2309.05463, arxiv, pdf, cication: 9

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee

  • The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only, arXiv, 2306.01116, arxiv, pdf, cication: 108

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay

  • FunQA: Towards Surprising Video Comprehension, arXiv, 2306.14899, arxiv, pdf, cication: 1

    Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu · (mp.weixin.qq)

  • The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants, arXiv, 2308.16884, arxiv, pdf, cication: -1

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa

  • MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records, arXiv, 2308.14089, arxiv, pdf, cication: 2

    Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak

  • Platypus: Quick, Cheap, and Powerful Refinement of LLMs, arXiv, 2308.07317, arxiv, pdf, cication: 5

    Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz

  • Leveraging Implicit Feedback from Deployment Data in Dialogue, arXiv, 2307.14117, arxiv, pdf, cication: 1

    Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston

  • UltraChat - thunlp Star

    Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)

  • Textbooks Are All You Need, arXiv, 2306.11644, arxiv, pdf, cication: 51

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi · (jiqizhixin) · (jiqizhixin)

MulitiMod

  • MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine, arXiv, 2408.02900, arxiv, pdf, cication: -1

    Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie

  • PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration, arXiv, 2407.00203, arxiv, pdf, cication: -1

    Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Zhongyi Shui, Kai Zhang, Jingxiong Li, Xingheng Lyu, Tao Lin, Lin Yang

  • UpVoteWeb - OpenCo7 🤗

  • PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents, arXiv, 2406.13923, arxiv, pdf, cication: -1

    Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang

  • What If We Recaption Billions of Web Images with LLaMA-3?, arXiv, 2406.08478, arxiv, pdf, cication: -1

    Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng · (Recap-DataComp-1B - UCSC-VLAA) Star

  • the_cauldron - HuggingFaceM4 🤗

  • Let-It-Wag - bethgelab 🤗

  • MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets, arXiv, 2403.03194, arxiv, pdf, cication: -1

    Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour

  • Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models, arXiv, 2403.00231, arxiv, pdf, cication: -1

    Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, Qi Liu

  • Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers, arXiv, 2402.19479, arxiv, pdf, cication: -1

    Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang

  • A Touch, Vision, and Language Dataset for Multimodal Alignment, arXiv, 2402.13232, arxiv, pdf, cication: -1

    Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg

  • Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding, arXiv, 2401.04575, arxiv, pdf, cication: -1

    Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson

  • Video Recognition in Portrait Mode, arXiv, 2312.13746, arxiv, pdf, cication: -1

    Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang, Heng Wang · (jiqizhixin)

  • MADLAD-400: A Multilingual And Document-Level Large Audited Dataset, arXiv, 2309.04662, arxiv, pdf, cication: 1

    Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna

  • OBELICS - HuggingFaceM4 🤗

  • Improving Multimodal Datasets with Image Captioning, arXiv, 2307.10350, arxiv, pdf, cication: 7

    Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt

  • InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation, arXiv, 2307.06942, arxiv, pdf, cication: 4

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models, arXiv, 2306.05424, arxiv, pdf, cication: 30

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan

  • Paper page - Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

  • M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning, arXiv, 2306.04387, arxiv, pdf, cication: 13

    Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun

Reasoning & Action

  • APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets, arXiv, 2406.18518, arxiv, pdf, cication: -1

    Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng

    · (apigen-pipeline.github)

  • json-mode-eval - NousResearch 🤗

  • Can Large Language Models Infer Causation from Correlation?, arXiv, 2306.05836, arxiv, pdf, cication: 11

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, Bernhard Schölkopf

  • Mind2Web: Towards a Generalist Agent for the Web, arXiv, 2306.06070, arxiv, pdf, cication: 16

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su

Alignment

Synthetic

  • SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models, arXiv, 2407.20756, arxiv, pdf, cication: -1

    Zheng Liu, Hao Liang, Xijie Huang, Wentao Xiong, Qinhan Yu, Linzhuang Sun, Chong Chen, Conghui He, Bin Cui, Wentao Zhang

  • Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models, arXiv, 2408.04594, arxiv, pdf, cication: -1

    Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen · (data-juicer - modelscope) Star

  • magpie-ultra-v0.1 - argilla 🤗

  • AgentInstruct: Toward Generative Teaching with Agentic Flows, arXiv, 2407.03502, arxiv, pdf, cication: -1

    Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset · (x)

  • OpenArena - syv-ai Star

  • Aligning Teacher with Student Preferences for Tailored Training Data Generation, arXiv, 2406.19227, arxiv, pdf, cication: -1

    Yantao Liu, Zhao Zhang, Zijun Yao, Shulin Cao, Lei Hou, Juanzi Li

  • Scaling Synthetic Data Creation with 1,000,000,000 Personas, arXiv, 2406.20094, arxiv, pdf, cication: -1

    Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu

    · (persona-hub - tencent-ailab) Star

  • Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing, arXiv, 2406.08464, arxiv, pdf, cication: -1

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin · (huggingface)

  • BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation, arXiv, 2405.09546, arxiv, pdf, cication: -1

    Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy

  • Genstruct-7B - NousResearch 🤗

  • Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

    · (cosmopedia - huggingface) Star

  • Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition, arXiv, 2402.15504, arxiv, pdf, cication: -1

    Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H. T. Kung, Yubei Chen

  • LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons, arXiv, 2402.14086, arxiv, pdf, cication: -1

    Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach

  • Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models, arXiv, 2402.13064, arxiv, pdf, cication: -1

    Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang

  • ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model, arXiv, 2402.11684, arxiv, pdf, cication: -1

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, Benyou Wang

  • DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows, arXiv, 2402.10379, arxiv, pdf, cication: -1

    Ajay Patel, Colin Raffel, Chris Callison-Burch · (DataDreamer - datadreamer-dev) Star

  • Synthetic Dialogue Dataset Generation using LLM Agents, arXiv, 2401.17461, arxiv, pdf, cication: -1

    Yelaman Abdullin, Diego Molla-Aliod, Bahadorreza Ofoghi, John Yearwood, Qingyang Li

  • Learning Vision from Models Rivals Learning Vision from Data, arXiv, 2312.17742, arxiv, pdf, cication: -1

    Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola · (mp.weixin.qq)

  • Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs, arXiv, 2310.13961, arxiv, pdf, cication: -1

    Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudillo

  • Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models, arXiv, 2310.13671, arxiv, pdf, cication: -1

    Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan

  • Ada-Instruct: Adapting Instruction Generators for Complex Reasoning, arXiv, 2310.04484, arxiv, pdf, cication: -1

    Wanyun Cui, Qianle Wang

  • PIPPA: A Partially Synthetic Conversational Dataset, arXiv, 2308.05884, arxiv, pdf, cication: -1

    Tear Gosling, Alpin Dale, Yinhe Zheng

  • Simple synthetic data reduces sycophancy in large language models, arXiv, 2308.03958, arxiv, pdf, cication: 6

    Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le · (qbitai)

  • DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI, arXiv, 2307.10172, arxiv, pdf, cication: -1

    Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong

  • Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias, arXiv, 2306.15895, arxiv, pdf, cication: 10

    Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang · (attrprompt - yueyu1030) Star

  • GPT Self-Supervision for a Better Data Annotator, arXiv, 2306.04349, arxiv, pdf, cication: 1

    Xiaohuan Pei, Yanxi Li, Chang Xu · (mp.weixin.qq)

  • The Curse of Recursion: Training on Generated Data Makes Models Forget, arXiv, 2305.17493, arxiv, pdf, cication: 3

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson

    · (mp.weixin.qq)

  • Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions, arXiv, 2306.04140, arxiv, pdf, cication: 8

    John Joon Young Chung, Ece Kamar, Saleema Amershi

  • Harnessing large-language models to generate private synthetic text, arXiv, 2306.01684, arxiv, pdf, cication: 1

    Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas Terzis

  • LIMA: Less Is More for Alignment, arXiv, 2305.11206, arxiv, pdf, cication: 116

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu

  • Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning, arXiv, 2305.09246, arxiv, pdf, cication: 6

    Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, Junbo Zhao

Vision

  • The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better, arXiv, 2406.05184, arxiv, pdf, cication: -1

    Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna · (unmet-promise - scottgeng00) Star

Toolkits

  • quality-classifier-deberta - nvidia 🤗

  • fineweb-edu-classifier - HuggingFaceFW 🤗

  • desbordante-core - Desbordante Star

    Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.

  • databonsai - databonsai Star

    clean & curate your data with LLMs.

  • distilabel - argilla-io Star

    ⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.

  • Cleaner Pretraining Corpus Curation with Neural Web Scraping, arXiv, 2402.14652, arxiv, pdf, cication: -1

    Zhipeng Xu, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Chenyan Xiong, Ge Yu · (NeuScraper - OpenMatch) Star

  • dsir - p-lambda Star

    DSIR large-scale data selection framework

  • data-juicer - alibaba Star

    A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!

Other

Extra reference

  • Awesome-LLMs-Datasets - lmmlzn Star

    Summarize existing representative LLMs text datasets.

  • awesome-instruction-datasets - jianzhnie Star

    A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。