Dataset Name | Size | Languages | Source | Cost | License |
---|---|---|---|---|---|
cc_sbu_align | 4K | English | MiniGPT-4 datadset | - | BSD 3-Clause License |
ChatAlpaca | 10K | English | The data currently contain a total of 10,000 conversations with 95,558 utterances. | - | Apache-2.0 license |
Dolly | 15K | English | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | - | CC 3.0 |
Code Alpaca | 20K | English | Code generation task involving 20,022 samples | - | - |
HC3 | 37K | English, Chinese | 37,175 instructions generated by ChatGPT and human | - | - |
Alpaca Dataset | 52K | English | 175 seed instructions by OpenAI API | <$500 | CC By NC 4.0; OpenAI terms of use |
Alpaca Data Cleaned | 52K | English | Revised version of Alpaca Dataset | - | - |
Alpaca GPT-4 Data | 52K | English | Generated by GPT-4 using Alpaca prompts | - | - |
Alpaca GPT-4 Data (Chinese) | 52K | Chinese | Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - | - |
Cabrita Dataset | 52K | Portuguese | Translated from Alpaca Data | - | |
Japanese Alpaca Dataset | 52K | Japanese | Translated from Alpaca Data by ChatGPT API | $45 | CC By NC 4.0; OpenAI terms of use |
Traditional Chinese Alpaca Dataset | 52K | Traditional Chinese | Translated from Alpaca Data by ChatGPT API | $40 | Apache-2.0 license |
Finance | 69K | English | 68,912 financial related instructions | - | - |
Vicuna Dataset | 75K | English | ~100k ShareGPT conversations | - | - |
InstructionTranslation | 80K | Multi-lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | - | MIT |
OASST1 | 89K | Multi-lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | - | apache-2.0 |
HH-RLHF | 91K | English | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | - | MIT |
Guanaco Dataset | 98K | English, Simplified Chinese, Traditional Chinese HK & TW, Japanese | 175 tasks from the Alpaca model | $6K | GPLv3 |
InstructionWild | 104K | English, Chinese | 429 seed instructions and follow Alpaca to generate 52K | $880 | Research only; OpenAI terms of use |
Camel Dataset | 107K | Multi-lingual | Role-playing between AIs (Open AI API) | - | |
LLaVA Visual Instruct | 150K | English | LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | - | cc-by-nc-4.0 |
Prosocial Dialog | 166K | English | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - | - |
Unnatural Instructions | 241K | English | a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. | - | MIT |
SHP | 358K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. | - | Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
ultrachat | 404K | English | To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | - | cc-by-nc-4.0 |
ELI5 | 559K | English | The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | ||
GPT4All Dataset | 806K | Multi-lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - | |
Instruct | 889K | English | 888,969 English instructions, augmentation using AllenAI NLP tools | - | MIT |
MOSS | 1M | Chinese | Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses | |
LaMini-Instruction | 3M | English | a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts | - | cc-by-nc-4.0 |
Natural Instructions | 5M | Multi-lingual | 5,040,134 instructions collected from diverse NLP tasks | - | - |
BELLE | 10M | Chinese | The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | - | Research only; OpenAI terms of use |
Firefly | 16M | Chinese | 1,649,398 Chinese instructions in 23 NLP tasks | - | - |
OIG-43M Dataset | 43M | Multi-lingual | Together, LAION, and Ontocord.ai. | - | |
xP3 | 79M | Multi-lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - | - |
Alpaca-CoT Dataset | - | Multi-lingual | Instruction Data Collection | - | ODC-By |
stack-exchange-paired | - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | - | cc-by-sa-4.0 |
CodeParrot | - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. |
forked from voidful/awesome-chatgpt-dataset
-
Notifications
You must be signed in to change notification settings - Fork 0
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
License
JackieQi/awesome-chatgpt-dataset
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published