Skip to content

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

License

Notifications You must be signed in to change notification settings

JackieQi/awesome-chatgpt-dataset

 
 

Repository files navigation

awesome-chatgpt-dataset

Alt Text

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Dataset Name Size Languages Source Cost License
cc_sbu_align 4K English MiniGPT-4 datadset - BSD 3-Clause License
ChatAlpaca 10K English The data currently contain a total of 10,000 conversations with 95,558 utterances. - Apache-2.0 license
Dolly 15K English databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. - CC 3.0
Code Alpaca 20K English Code generation task involving 20,022 samples - -
HC3 37K English, Chinese 37,175 instructions generated by ChatGPT and human - -
Alpaca Dataset 52K English 175 seed instructions by OpenAI API <$500 CC By NC 4.0; OpenAI terms of use
Alpaca Data Cleaned 52K English Revised version of Alpaca Dataset - -
Alpaca GPT-4 Data 52K English Generated by GPT-4 using Alpaca prompts - -
Alpaca GPT-4 Data (Chinese) 52K Chinese Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT - -
Cabrita Dataset 52K Portuguese Translated from Alpaca Data -
Japanese Alpaca Dataset 52K Japanese Translated from Alpaca Data by ChatGPT API $45 CC By NC 4.0; OpenAI terms of use
Traditional Chinese Alpaca Dataset 52K Traditional Chinese Translated from Alpaca Data by ChatGPT API $40 Apache-2.0 license
Finance 69K English 68,912 financial related instructions - -
Vicuna Dataset 75K English ~100k ShareGPT conversations - -
InstructionTranslation 80K Multi-lingual Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). - MIT
OASST1 89K Multi-lingual a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. - apache-2.0
HH-RLHF 91K English The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. - MIT
Guanaco Dataset 98K English, Simplified Chinese, Traditional Chinese HK & TW, Japanese 175 tasks from the Alpaca model $6K GPLv3
InstructionWild 104K English, Chinese 429 seed instructions and follow Alpaca to generate 52K $880 Research only; OpenAI terms of use
Camel Dataset 107K Multi-lingual Role-playing between AIs (Open AI API) -
LLaVA Visual Instruct 150K English LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. - cc-by-nc-4.0
Prosocial Dialog 166K English 165,681 instructions produced by GPT-3 rewrites questions and human feedback - -
Unnatural Instructions 241K English a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. - MIT
SHP 358K English SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. - Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license
ultrachat 404K English To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. - cc-by-nc-4.0
ELI5 559K English The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers.
GPT4All Dataset 806K Multi-lingual Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. -
Instruct 889K English 888,969 English instructions, augmentation using AllenAI NLP tools - MIT
MOSS 1M Chinese Generated by gpt-3.5-turbo Apache-2.0, AGPL-3.0 licenses
LaMini-Instruction 3M English a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts - cc-by-nc-4.0
Natural Instructions 5M Multi-lingual 5,040,134 instructions collected from diverse NLP tasks - -
BELLE 10M Chinese The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. - Research only; OpenAI terms of use
Firefly 16M Chinese 1,649,398 Chinese instructions in 23 NLP tasks - -
OIG-43M Dataset 43M Multi-lingual Together, LAION, and Ontocord.ai. -
xP3 79M Multi-lingual 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks - -
Alpaca-CoT Dataset - Multi-lingual Instruction Data Collection - ODC-By
stack-exchange-paired - English This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. - cc-by-sa-4.0
CodeParrot - python The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files.

About

Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published