-
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey,
arXiv, 2406.15126
, arxiv, pdf, cication: -1Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang
-
On Protecting the Data Privacy of Large Language Models (LLMs): A Survey,
arXiv, 2403.05156
, arxiv, pdf, cication: -1Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, Xiuzheng Cheng
-
The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources,
arXiv, 2406.16746
, arxiv, pdf, cication: -1Shayne Longpre, Stella Biderman, Alon Albalak, Hailey Schoelkopf, Daniel McDuff, Sayash Kapoor, Kevin Klyman, Kyle Lo, Gabriel Ilharco, Nay San
-
fm-cheatsheet - allenai
Website for hosting the Open Foundation Models Cheat Sheet. · (fm-cheatsheet - allenai)
· (fmcheatsheet)
-
Datasets for Large Language Models: A Comprehensive Survey,
arXiv, 2402.18041
, arxiv, pdf, cication: -1Yang Liu, Jiahuan Cao, Chongyu Liu, Kai Ding, Lianwen Jin
· (Awesome-LLMs-Datasets - lmmlzn)
-
A Survey on Data Selection for Language Models,
arXiv, 2402.16827
, arxiv, pdf, cication: -1Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong
-
Data Management For Large Language Models: A Survey,
arXiv, 2312.01700
, arxiv, pdf, cication: -1Zige Wang, Wanjun Zhong, Yufei Wang, Qi Zhu, Fei Mi, Baojun Wang, Lifeng Shang, Xin Jiang, Qun Liu
-
Data Contamination Report from the 2024 CONDA Shared Task,
arXiv, 2407.21530
, arxiv, pdf, cication: -1Oscar Sainz, Iker García-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen · (conda-workshop.github)
-
DataComp-LM: In search of the next generation of training sets for language models,
arXiv, 2406.11794
, arxiv, pdf, cication: -1Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora · (datacomp)
· (dclm - mlfoundations)
-
Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining,
arXiv, 2405.14908
, arxiv, pdf, cication: -1Ce Ge, Zhijian Ma, Daoyuan Chen, Yaliang Li, Bolin Ding
-
Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach,
arXiv, 2405.15613
, arxiv, pdf, cication: -1Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin
-
Dynamic data sampler for cross-language transfer learning in large language models,
icassp 2024-2024 ieee international conference on acoustics …, 2024
, arxiv, pdf, cication: -1Yudong Li, Yuhao Feng, Wen Zhou, Zhe Zhao, Linlin Shen, Cheng Hou, Xianxu Hou
-
Fewer Truncations Improve Language Modeling,
arXiv, 2404.10830
, arxiv, pdf, cication: -1Hantian Ding, Zijian Wang, Giovanni Paolini, Varun Kumar, Anoop Deoras, Dan Roth, Stefano Soatto
-
Best Practices and Lessons Learned on Synthetic Data for Language Models,
arXiv, 2404.07503
, arxiv, pdf, cication: -1Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou
-
Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic,
arXiv, 2404.07177
, arxiv, pdf, cication: -1Sachin Goyal, Pratyush Maini, Zachary C. Lipton, Aditi Raghunathan, J. Zico Kolter · (scaling_laws_data_filtering - locuslab)
-
Training LLMs over Neurally Compressed Text,
arXiv, 2404.03626
, arxiv, pdf, cication: -1Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant
explores training LLMs with neural text compressors; the proposed compression technique segments text into blocks that each compress to the same bit length; the approach improves at scale and outperforms byte-level baselines on both perplexity and inference speed benchmarks; latency is reduced to the shorter sequence length.
-
LESS: Selecting Influential Data for Targeted Instruction Tuning,
arXiv, 2402.04333
, arxiv, pdf, cication: -1Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen · (cs.princeton)
-
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?,
arXiv, 2403.06833
, arxiv, pdf, cication: -1Egor Zverev, Sahar Abdelnabi, Mario Fritz, Christoph H. Lampert · (Should-It-Be-Executed-Or-Processed - egozverev)
-
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance,
arXiv, 2403.16952
, arxiv, pdf, cication: -1Jiasheng Ye, Peiju Liu, Tianxiang Sun, Yunhua Zhou, Jun Zhan, Xipeng Qiu · (mixinglaws - yegcjs)
-
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement,
arXiv, 2403.15042
, arxiv, pdf, cication: -1Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
improves the performance of large language models in low-data scenarios by using a teacher model to generate synthetic data from errors made by a student model during initial training
-
Are Human Conversations Special? A Large Language Model Perspective,
arXiv, 2403.05045
, arxiv, pdf, cication: -1Toshish Jawale, Chaitanya Animesh, Sekhar Vallath, Kartik Talamadupula, Larry Heck
-
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation,
arXiv, 2402.18334
, arxiv, pdf, cication: -1Nihal V. Nayak, Yiyang Nan, Avi Trost, Stephen H. Bach · (bonito - batsresearch)
-
How to Train Data-Efficient LLMs,
arXiv, 2402.09668
, arxiv, pdf, cication: -1Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng
-
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling,
arXiv, 2401.16380
, arxiv, pdf, cication: -1Pratyush Maini, Skyler Seto, He Bai, David Grangier, Yizhe Zhang, Navdeep Jaitly
-
Genie: Achieving Human Parity in Content-Grounded Datasets Generation,
arXiv, 2401.14367
, arxiv, pdf, cication: -1Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen
-
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI,
arXiv, 2401.14019
, arxiv, pdf, cication: -1Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen · (unitxt - IBM)
-
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks,
arXiv, 2401.06751
, arxiv, pdf, cication: -1Peter Hase, Mohit Bansal, Peter Clark, Sarah Wiegreffe · (easy-to-hard-generalization - allenai)
-
A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism,
arXiv, 2401.05749
, arxiv, pdf, cication: -1Brian Thompson, Mehak Preet Dhaliwal, Peter Frisch, Tobias Domhan, Marcello Federico
-
What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning,
arXiv, 2312.15685
, arxiv, pdf, cication: -1Wei Liu, Weihao Zeng, Keqing He, Yong Jiang, Junxian He · (deita - hkust-nlp)
-
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning,
arXiv, 2312.06134
, arxiv, pdf, cication: -1Dami Choi, Derrick Xin, Hamid Dadkhahi, Justin Gilmer, Ankush Garg, Orhan Firat, Chih-Kuan Yeh, Andrew M. Dai, Behrooz Ghorbani
-
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale,
arXiv, 2309.04564
, arxiv, pdf, cication: 23Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker
-
AlpaGasus: Training A Better Alpaca with Fewer Data,
arXiv, 2307.08701
, arxiv, pdf, cication: -1Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang
-
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining,
NeurIPS, 2024
, arxiv, pdf, cication: 34Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu
-
Scaling Data-Constrained Language Models,
arXiv, 2305.16264
, arxiv, pdf, cication: -1Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, Colin Raffel
· (datablations - huggingface)
-
dclm-baseline-1.0 - mlfoundations 🤗
-
smollm-corpus - HuggingFaceTB 🤗
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale,
arXiv, 2406.17557
, arxiv, pdf, cication: -1Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf
-
fineweb-edu - HuggingFaceFW 🤗
-
blogpost-fineweb-v1 - HuggingFaceFW 🤗
-
RedPajama-Data - togethercomputer
-
fineweb - HuggingFaceFW 🤗
-
idl-wds - pixparse 🤗
-
synthetic_text_to_sql - gretelai 🤗
-
Caselaw_Access_Project - TeraflopAI 🤗
· (twitter)
-
Common Corpus - a PleIAs Collection
· (huggingface)
-
internet_archive_books_en - storytracer 🤗
-
orca-math-word-problems-200k - microsoft 🤗
-
OpenHermesPreferences - argilla 🤗
-
WebSight - HuggingFaceM4 🤗
· (huggingface)
-
oasst2 - OpenAssistant 🤗
-
wikisource - wikimedia 🤗
-
pii-masking-200k - ai4privacy 🤗
-
SlimPajama-627B - cerebras 🤗
· (modelzoo - Cerebras)
-
MADLAD-400 - allenai 🤗
-
peS2o - allenai 🤗
-
CharacterCodex - NousResearch 🤗
-
Zyda: A 1.3T Dataset for Open Language Modeling,
arXiv, 2406.01981
, arxiv, pdf, cication: -1Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony · (zyphra)
-
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels,
arXiv, 2405.07526
, arxiv, pdf, cication: -1Qi Chen, Xiubo Geng, Corby Rosset, Carolyn Buractaon, Jingwen Lu, Tao Shen, Kun Zhou, Chenyan Xiong, Yeyun Gong, Paul Bennett · (MS-MARCO-Web-Search - microsoft)
-
WildChat: 1M ChatGPT Interaction Logs in the Wild,
arXiv, 2405.01470
, arxiv, pdf, cication: -1Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
-
CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies,
arXiv, 2404.15238
, arxiv, pdf, cication: -1Weiyan Shi, Ryan Li, Yutong Zhang, Caleb Ziems, Chunhua yu, Raya Horesh, Rogério Abreu de Paula, Diyi Yang · (culturebank.github)
-
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning,
arXiv, 2403.18058
, arxiv, pdf, cication: -1Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang · (huggingface) · (COIG-CQIA - paralym)
· (qbitai)
-
10k_prompts_ranked - DIBT 🤗
-
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning,
arXiv, 2402.06619
, arxiv, pdf, cication: 1Shivalika Singh, Freddie Vargus, Daniel Dsouza, Börje F. Karlsson, Abinaya Mahendiran, Wei-Yin Ko, Herumb Shandilya, Jay Patel, Deividas Mataciunas, Laura OMahony
· (youtube)
-
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research,
arXiv, 2402.00159
, arxiv, pdf, cication: -1Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar
-
openhathi_instruct - pacman100
This repository contains the code for dataset curation and finetuning of instruct variant of the Bilingual OpenHathi model. The resulting model is meant to follow instructions and chat in Hindi and Hinglish.
-
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset,
arXiv, 2309.04662
, arxiv, pdf, cication: -1Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna
· (google-research - google-research)
-
Phi-2: The surprising power of small language models - Microsoft Research
-
What's In My Big Data?,
arXiv, 2310.20707
, arxiv, pdf, cication: -1Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh
-
orca - nuochenpku
Orca: A Few-shot Benchmark for Chinese Conversational Machine Reading Comprehension
-
UltraFeedback - OpenBMB
A large-scale, fine-grained, diverse preference dataset (and models).
-
How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition,
arXiv, 2310.05492
, arxiv, pdf, cication: -1Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, Jingren Zhou
-
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset,
arXiv, 2309.11998
, arxiv, pdf, cication: 3Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing
-
SlimPajama-DC: Understanding Data Combinations for LLM Training,
arXiv, 2309.10818
, arxiv, pdf, cication: -1Zhiqiang Shen, Tianhua Tao, Liqun Ma, Willie Neiswanger, Zhengzhong Liu, Hongyi Wang, Bowen Tan, Joel Hestness, Natalia Vassilieva, Daria Soboleva
-
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages,
arXiv, 2309.09400
, arxiv, pdf, cication: -1Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A. Rossi, Thien Huu Nguyen
-
Textbooks Are All You Need II: phi-1.5 technical report,
arXiv, 2309.05463
, arxiv, pdf, cication: 9Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee
-
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only,
arXiv, 2306.01116
, arxiv, pdf, cication: 108Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay
-
FunQA: Towards Surprising Video Comprehension,
arXiv, 2306.14899
, arxiv, pdf, cication: 1Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu · (mp.weixin.qq)
-
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants,
arXiv, 2308.16884
, arxiv, pdf, cication: -1Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, Madian Khabsa
-
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records,
arXiv, 2308.14089
, arxiv, pdf, cication: 2Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak
-
Platypus: Quick, Cheap, and Powerful Refinement of LLMs,
arXiv, 2308.07317
, arxiv, pdf, cication: 5Ariel N. Lee, Cole J. Hunter, Nataniel Ruiz
-
Leveraging Implicit Feedback from Deployment Data in Dialogue,
arXiv, 2307.14117
, arxiv, pdf, cication: 1Richard Yuanzhe Pang, Stephen Roller, Kyunghyun Cho, He He, Jason Weston
-
UltraChat - thunlp
Large-scale, Informative, and Diverse Multi-round Chat Data (and Models)
-
Textbooks Are All You Need,
arXiv, 2306.11644
, arxiv, pdf, cication: 51Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi · (jiqizhixin) · (jiqizhixin)
-
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine,
arXiv, 2408.02900
, arxiv, pdf, cication: -1Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie
-
PathGen-1.6M: 1.6 Million Pathology Image-text Pairs Generation through Multi-agent Collaboration,
arXiv, 2407.00203
, arxiv, pdf, cication: -1Yuxuan Sun, Yunlong Zhang, Yixuan Si, Chenglu Zhu, Zhongyi Shui, Kai Zhang, Jingxiong Li, Xingheng Lyu, Tao Lin, Lin Yang
-
UpVoteWeb - OpenCo7 🤗
-
PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents,
arXiv, 2406.13923
, arxiv, pdf, cication: -1Junjie Wang, Yin Zhang, Yatai Ji, Yuxiang Zhang, Chunyang Jiang, Yubo Wang, Kang Zhu, Zekun Wang, Tiezhen Wang, Wenhao Huang
-
What If We Recaption Billions of Web Images with LLaMA-3?,
arXiv, 2406.08478
, arxiv, pdf, cication: -1Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng · (Recap-DataComp-1B - UCSC-VLAA)
-
the_cauldron - HuggingFaceM4 🤗
-
Let-It-Wag - bethgelab 🤗
-
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets,
arXiv, 2403.03194
, arxiv, pdf, cication: -1Hossein Aboutalebi, Hwanjun Song, Yusheng Xie, Arshit Gupta, Justin Sun, Hang Su, Igor Shalyminov, Nikolaos Pappas, Siffi Singh, Saab Mansour
-
Multimodal ArXiv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models,
arXiv, 2403.00231
, arxiv, pdf, cication: -1Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, Qi Liu
-
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers,
arXiv, 2402.19479
, arxiv, pdf, cication: -1Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang
-
A Touch, Vision, and Language Dataset for Multimodal Alignment,
arXiv, 2402.13232
, arxiv, pdf, cication: -1Letian Fu, Gaurav Datta, Huang Huang, William Chung-Ho Panitch, Jaimyn Drake, Joseph Ortiz, Mustafa Mukadam, Mike Lambeta, Roberto Calandra, Ken Goldberg
-
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding,
arXiv, 2401.04575
, arxiv, pdf, cication: -1Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson
-
Video Recognition in Portrait Mode,
arXiv, 2312.13746
, arxiv, pdf, cication: -1Mingfei Han, Linjie Yang, Xiaojie Jin, Jiashi Feng, Xiaojun Chang, Heng Wang · (jiqizhixin)
-
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset,
arXiv, 2309.04662
, arxiv, pdf, cication: 1Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna
-
OBELICS - HuggingFaceM4 🤗
-
Improving Multimodal Datasets with Image Captioning,
arXiv, 2307.10350
, arxiv, pdf, cication: 7Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig Schmidt
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation,
arXiv, 2307.06942
, arxiv, pdf, cication: 4Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu
-
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models,
arXiv, 2306.05424
, arxiv, pdf, cication: 30Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan
-
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning,
arXiv, 2306.04387
, arxiv, pdf, cication: 13Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun
-
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets,
arXiv, 2406.18518
, arxiv, pdf, cication: -1Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng
-
json-mode-eval - NousResearch 🤗
-
Can Large Language Models Infer Causation from Correlation?,
arXiv, 2306.05836
, arxiv, pdf, cication: 11Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, Bernhard Schölkopf
-
Mind2Web: Towards a Generalist Agent for the Web,
arXiv, 2306.06070
, arxiv, pdf, cication: 16Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, Yu Su
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing,
arXiv, 2406.08464
, arxiv, pdf, cication: 2Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin · (huggingface) · (magpie - magpie-align) · (magpie-align.github)
-
moss-002-sft-data - fnlp 🤗
-
tasksource_dpo_pairs - tasksource 🤗
-
Infinity-Instruct - BAAI 🤗
-
Capybara-Preferences - argilla 🤗
-
SynthVLM: High-Efficiency and High-Quality Synthetic Data for Vision Language Models,
arXiv, 2407.20756
, arxiv, pdf, cication: -1Zheng Liu, Hao Liang, Xijie Huang, Wentao Xiong, Qinhan Yu, Linzhuang Sun, Chong Chen, Conghui He, Bin Cui, Wentao Zhang
-
Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models,
arXiv, 2408.04594
, arxiv, pdf, cication: -1Qirui Jiao, Daoyuan Chen, Yilun Huang, Yaliang Li, Ying Shen · (data-juicer - modelscope)
-
magpie-ultra-v0.1 - argilla 🤗
-
AgentInstruct: Toward Generative Teaching with Agentic Flows,
arXiv, 2407.03502
, arxiv, pdf, cication: -1Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset · (x)
-
OpenArena - syv-ai
-
Aligning Teacher with Student Preferences for Tailored Training Data Generation,
arXiv, 2406.19227
, arxiv, pdf, cication: -1Yantao Liu, Zhao Zhang, Zijun Yao, Shulin Cao, Lei Hou, Juanzi Li
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas,
arXiv, 2406.20094
, arxiv, pdf, cication: -1Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu
· (persona-hub - tencent-ailab)
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing,
arXiv, 2406.08464
, arxiv, pdf, cication: -1Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, Bill Yuchen Lin · (huggingface)
-
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation,
arXiv, 2405.09546
, arxiv, pdf, cication: -1Yunhao Ge, Yihe Tang, Jiashu Xu, Cem Gokmen, Chengshu Li, Wensi Ai, Benjamin Jose Martinez, Arman Aydin, Mona Anvari, Ayush K Chakravarthy
-
Genstruct-7B - NousResearch 🤗
-
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models
· (cosmopedia - huggingface)
-
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition,
arXiv, 2402.15504
, arxiv, pdf, cication: -1Chun-Hsiao Yeh, Ta-Ying Cheng, He-Yen Hsieh, Chuan-En Lin, Yi Ma, Andrew Markham, Niki Trigoni, H. T. Kung, Yubei Chen
-
LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons,
arXiv, 2402.14086
, arxiv, pdf, cication: -1Zheng-Xin Yong, Cristina Menghini, Stephen H. Bach
-
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models,
arXiv, 2402.13064
, arxiv, pdf, cication: -1Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang
-
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model,
arXiv, 2402.11684
, arxiv, pdf, cication: -1Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, Benyou Wang
-
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows,
arXiv, 2402.10379
, arxiv, pdf, cication: -1Ajay Patel, Colin Raffel, Chris Callison-Burch · (DataDreamer - datadreamer-dev)
-
Synthetic Dialogue Dataset Generation using LLM Agents,
arXiv, 2401.17461
, arxiv, pdf, cication: -1Yelaman Abdullin, Diego Molla-Aliod, Bahadorreza Ofoghi, John Yearwood, Qingyang Li
-
Learning Vision from Models Rivals Learning Vision from Data,
arXiv, 2312.17742
, arxiv, pdf, cication: -1Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola · (mp.weixin.qq)
-
Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs,
arXiv, 2310.13961
, arxiv, pdf, cication: -1Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudillo
-
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models,
arXiv, 2310.13671
, arxiv, pdf, cication: -1Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan
-
Ada-Instruct: Adapting Instruction Generators for Complex Reasoning,
arXiv, 2310.04484
, arxiv, pdf, cication: -1Wanyun Cui, Qianle Wang
-
PIPPA: A Partially Synthetic Conversational Dataset,
arXiv, 2308.05884
, arxiv, pdf, cication: -1Tear Gosling, Alpin Dale, Yinhe Zheng
-
Simple synthetic data reduces sycophancy in large language models,
arXiv, 2308.03958
, arxiv, pdf, cication: 6Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le · (qbitai)
-
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI,
arXiv, 2307.10172
, arxiv, pdf, cication: -1Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong
-
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias,
arXiv, 2306.15895
, arxiv, pdf, cication: 10Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen, Chao Zhang · (attrprompt - yueyu1030)
-
GPT Self-Supervision for a Better Data Annotator,
arXiv, 2306.04349
, arxiv, pdf, cication: 1Xiaohuan Pei, Yanxi Li, Chang Xu · (mp.weixin.qq)
-
The Curse of Recursion: Training on Generated Data Makes Models Forget,
arXiv, 2305.17493
, arxiv, pdf, cication: 3Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, Ross Anderson
· (mp.weixin.qq)
-
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions,
arXiv, 2306.04140
, arxiv, pdf, cication: 8John Joon Young Chung, Ece Kamar, Saleema Amershi
-
Harnessing large-language models to generate private synthetic text,
arXiv, 2306.01684
, arxiv, pdf, cication: 1Alexey Kurakin, Natalia Ponomareva, Umar Syed, Liam MacDermed, Andreas Terzis
-
LIMA: Less Is More for Alignment,
arXiv, 2305.11206
, arxiv, pdf, cication: 116Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu
-
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning,
arXiv, 2305.09246
, arxiv, pdf, cication: 6Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, Junbo Zhao
-
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better,
arXiv, 2406.05184
, arxiv, pdf, cication: -1Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna · (unmet-promise - scottgeng00)
-
quality-classifier-deberta - nvidia 🤗
-
fineweb-edu-classifier - HuggingFaceFW 🤗
-
desbordante-core - Desbordante
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
-
databonsai - databonsai
clean & curate your data with LLMs.
-
distilabel - argilla-io
⚗️ distilabel is a framework for synthetic data and AI feedback for AI engineers that require high-quality outputs, full data ownership, and overall efficiency.
-
Cleaner Pretraining Corpus Curation with Neural Web Scraping,
arXiv, 2402.14652
, arxiv, pdf, cication: -1Zhipeng Xu, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Chenyan Xiong, Ge Yu · (NeuScraper - OpenMatch)
-
dsir - p-lambda
DSIR large-scale data selection framework
-
data-juicer - alibaba
A one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs! 🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷为大语言模型提供更高质量、更丰富、更易”消化“的数据!
-
Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality
-
Synthetic data: save money, time and carbon with open source
-
Awesome-LLMs-Datasets - lmmlzn
Summarize existing representative LLMs text datasets.
-
awesome-instruction-datasets - jianzhnie
A collection of awesome-prompt-datasets, awesome-instruction-dataset, to train ChatLLM such as chatgpt 收录各种各样的指令数据集, 用于训练 ChatLLM 模型。