Beijing Academy of Artificial Intelligence (BAAI)
[Paper][Code][🤗] (would be released soon)
The quality and scale of instruction data are crucial for model performance. Recently, open-source models have increasingly relied on fine-tuning datasets comprising millions of instances, necessitating both high quality and large scale. However, the open-source community has long been constrained by the high costs associated with building such extensive and high-quality instruction fine-tuning datasets, which has limited related research and applications. To address this gap, we are introducing the Infinity Instruct project, aiming to develop a large-scale, high-quality instruction dataset.
- 🔥🔥🔥[2024/06/13] We are pleased to share the results of our intermediate checkpoints. Our ongoing efforts focus on risk assessment and data generation. The finalized version with 10 million instructions is scheduled for release in late June.
Flopsera [http://open.flopsera.com/flopsera-open/details/InfinityInstruct]
huggingface[https://huggingface.co/datasets/BAAI/Infinity-Instruct]
Model | MT-Bench | AlpacaEval2.0 |
---|---|---|
OpenHermes-2.5-Mistral-7B | 7.5 | 16.2 |
Mistral-7B-Instruct-v0.2 | 7.6 | 17.1 |
Llama-3-8B-Instruct* | 8.1 | 22.9 |
GPT 3.5 Turbo 0613 | 8.4 | 22.7 |
Mixtral 8x7B v0.1 | 8.3 | 23.7 |
Gemini Pro | -- | 24.4 |
InfInstruct-3M-Mistral-7B | 7.3 | 14.3 |
InfInstruct-Mistral-7B 0608 | 7.8 | 16.9 |
InfInstruct-Mistral-7B 0612 | 7.9 | 25.1 |
GPT-4-0613 | 9.2 | 30.2 |
Llama-3-70B-Instruct* | 9.0 | 34.4 |
InfInstruct-3M-Llama-3-70B | 8.4 | 21.8 |
InfInstruct-Llama-3-70B 0608 | 8.9 | 27.1 |
InfInstruct-Llama-3-70B 0612 | 8.6 | 30.7 |
InfInstruct-Llama-3-70B 0613 | 8.7 | 31.5 |
*denotes the results come from web
Model | MMLU | GSM8K | HumanEval | HellaSwag | Average |
---|---|---|---|---|---|
Mistral-7B | 56.5 | 48.1 | 14.0 | 35.5 | 38.5 |
Mistral-7B-Instruct-v0.2 | 59.6 | 45.9 | 32.9 | 64.4 | 50.7 |
OpenHermes-2.5-Mistral-7B | 61.7 | 73.0 | 41.5 | 80.6 | 64.2 |
InfInstruct-3M-Mistral-7B | 62.9 | 78.1 | 50.6 | 84.8 | 69.1 |
We collect large-scale instruct data from the open-source community. The data sources are listed as follows:
-
The detailed information on Infinity-Instruct-3M is shown in the following table.
Raw Dataset | Numbers of Rows |
---|---|
glaiveai/glaive-code-assistant-v3 | 138157 |
Replete-AI/code_bagel_hermes-2.5 | 506346 |
m-a-p/CodeFeedback-Filtered-Instruction | 104848 |
bigcode/self-oss-instruct-sc2-exec-filter-50k | 50661 |
codefuse-ai/CodeExercise-Python-27k | 27224 |
nickrosh/Evol-Instruct-Code-80k-v1 | 78264 |
TIGER-Lab/MathInstruct | 188486 |
microsoft/orca-math-word-problems-200k | 200035 |
MetaMathQa | 395000 |
teknium/Openhermes-2.5 | 1001551 |
Math | 320130 |
Selected subjective instructions | 1362000 |
Summary | 4372702 |
- Source and number of subjective instructions:
Raw Dataset | Numbers of Rows |
---|---|
Alpaca GPT4 data | 13490 |
Alpaca GPT4 data zh | 32589 |
Baize | 14906 |
BELLE Generated Chat | 43775 |
BELLE Multiturn Chat | 210685 |
BELLE 3.5M CN | 312598 |
databricks-dolly-15K | 10307 |
LIMA-sft | 712 |
CodeContest | 523 |
LongForm | 3290 |
ShareGPT-Chinese-English-90k | 8919 |
UltraChat | 276345 |
Wizard evol instruct zh | 44738 |
Wizard evol instruct 196K | 88681 |
BELLE School Math | 38329 |
Code Alpaca 20K | 13296 |
WildChat | 61873 |
COIG-CQIA | 45793 |
BAGEL | 55193 |
DEITA | 10000 |
Math | 320130 |
Summary | 1362000 |
The domain distribution of the subjective instruction category are shown in the following picture.
To create an objective ranking, we utilize datasets such as Flan and OpenHermes, with a focus on enhancing code and math capabilities. The method includes detailed topic distribution tagging of the evaluation set (e.g., data structures, sorting in humaneval). We apply heuristic rules to filter out irrelevant data based on the dataset source (e.g., removing network or file I/O operations). We further retrieve a subset from the training set based on the distribution in the validation sets.
We start by collecting high-quality open-source instruction sets. We assign each instruction in the collection a set of tags that describe the abilities and knowledge necessary to complete the instruction. With this tagging system, we can recognize the content distribution of the collection and the abilities required for completing different tasks.
-
Instruction collection: We systematically reviewed available open-source instruction sets and included sets created by humans and advanced LLMs.
-
Tag System: with totally two levels:
- First level tag: Describe the specific knowledge and abilities required for completing each instruction (e.g., Arithmetic Calculation, Knowledge of Biology). The tags are automatically generated by LLM.
- Second level tags: Macro categories such as "Natural Language Processing" and "Math Reasoning." Including 25 categories in total.
Aimed at selecting most informative instructions from the whole collection for enhancing the performance of LLM and improving user experience.
- Informative Instructions:
- Instructions demand multiple kinds of abilities or multiple domains of knowledge. Such instructions are recognized by our tag system.
- Instructions with long-tailed ability or knowledge;
- Instructions with high following difficulty. The following difficulty of instructions is obtained using the method of Li et al. [1].
We expand the seed instructions in directions breadth, depth, difficulty, and complexity with a method built based on [2], and use AI assistants to generate multi-turn data.
- Based on the metadata selected in the previous section, we expand the instructions by randomly selecting one dimension from breadth, depth, difficulty and complexity dimensions on the basis of the Evol-Instruct method.
- Validate the evolved data, and use AI assistants to eliminate data that failed to evolve from the perspective of instruction compliance.
- Use the evolved instructions as the initial input, and use an AI assistant to play different roles to generate 2 to 4 rounds of dialogue for each instruction.
Automatically identifying weaknesses in the model's capabilities to guide the synthesis of data.
- Model performance evaluation System: Constituted by a collection of commonly used evaluation sets;
- Automatic ability deficient diagnosis: Inducing shortcuts based on ground truth answers and model outputs using AI assistants;
- Targeted data synthesis: Automatically generate new instructions using AI assistants based on the induced deficiencies.
The resources, including code, data, and model weights, associated with this project are restricted for academic research purposes only and cannot be used for commercial purposes. The content produced by any version of Infinity Instruct is influenced by uncontrollable variables such as randomness, and therefore, the accuracy of the output cannot be guaranteed by this project. This project does not accept any legal liability for the content of the model output, nor does it assume responsibility for any losses incurred due to the use of associated resources and output results.
[1] Li M, Zhang Y, He S, et al. Superfiltering: Weak-to-strong data filtering for fast instruction-tuning[J]. arXiv preprint arXiv:2402.00530, 2024.
[2] Xu C, Sun Q, Zheng K, et al. WizardLM: Empowering large pre-trained language models to follow complex instructions[C]//The Twelfth International Conference on Learning Representations. 2023.
Our paper, detailing the development and features of the Infinity Instruct dataset, will be released soon on arXiv. Stay tuned!
@article{InfinityInstruct2024,
title={Infinity Instruct},
author={Beijing Academy of Artificial Intelligence (BAAI)},
journal={arXiv preprint arXiv:2406.XXXX},
year={2024}
}