From f02edf726407648d456761cc252a71691ee7b881 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=81=93=E8=BE=95?= Date: Sun, 12 Jan 2025 13:55:58 +0800 Subject: [PATCH 1/4] update en homepage for DJ2.0 and DJ-Cookbook --- README.md | 218 +++++++++++++++++++++++++++--------------------------- 1 file changed, 111 insertions(+), 107 deletions(-) diff --git a/README.md b/README.md index 95eba1da2..87c288200 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,9 @@ -[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[DJ-SORA]](docs/DJ_SORA.md) | [[Awesome List]](docs/awesome_llm_data.md) +[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[OperatorZoo]](docs/Operators.md) | [[API]](https://modelscope.github.io/data-juicer) | [[Awesome LLM Data]](docs/awesome_llm_data.md) -# Data-Juicer: A One-Stop Data Processing System for Large Language Models +# Data Processing for and with Foundation Models - Data-Juicer + Data-Juicer ![](https://img.shields.io/badge/language-Python-214870.svg) ![](https://img.shields.io/badge/license-Apache--2.0-000000.svg) @@ -11,7 +11,7 @@ [![Docker version](https://img.shields.io/docker/v/datajuicer/data-juicer?logo=docker&label=Docker&color=498bdf)](https://hub.docker.com/r/datajuicer/data-juicer) [![DataModality](https://img.shields.io/badge/DataModality-Text,Image,Audio,Video-brightgreen.svg)](docs/DeveloperGuide_ZH.md) -[![Usage](https://img.shields.io/badge/Usage-Cleaning,Generation,Analysis-FFD21E.svg)](docs/DeveloperGuide_ZH.md) +[![Usage](https://img.shields.io/badge/Usage-Cleaning,Synthesis,Analysis-FFD21E.svg)](docs/DeveloperGuide_ZH.md) [![ModelScope- Demos](https://img.shields.io/badge/ModelScope-Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](https://modelscope.cn/studios?name=Data-Jiucer&page=1&sort=latest&type=1) [![HuggingFace- Demos](https://img.shields.io/badge/🤗HuggingFace-Demos-4e29ff.svg)](https://huggingface.co/spaces?&search=datajuicer) @@ -19,36 +19,42 @@ [![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documents) [![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documents) -[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) -[![Paper](http://img.shields.io/badge/cs.LG-arXiv%3A2309.02033-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033) +[![OpZoo](https://img.shields.io/badge/Doc-OperatorZoo-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) +[![算子池](https://img.shields.io/badge/文档-算子池-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) +[![Paper](http://img.shields.io/badge/cs.LG-1.0Paper(SIGMOD'24)-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033) +[![Paper](http://img.shields.io/badge/cs.AI-2.0Paper-B31B1B?logo=arxiv&logoColor=red)](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf) -Data-Juicer is a one-stop **multimodal** data processing system to make data higher-quality, -juicier, and more digestible for LLMs. - +Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs). We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try Data-Juicer](http://8.138.149.181/) straight away in your browser! If you find Data-Juicer useful for your research or development, please kindly cite our [work](#references). [Platform for AI of Alibaba Cloud (PAI)](https://www.aliyun.com/product/bigdata/learn) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: [PAI-Data Processing for Large Models](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX). -Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. +Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features and data recipes. We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw) channel, [DingDing](https://qr.dingtalk.com/action/joingroup?spm=a2c22.12281976.0.0.7a8253f30mgpjw&code=v1,k1,C0DI7CwRFrg7gJP5aMC95FUmsNuwuKJboT62BqP5DAk=&_dt_no_comment=1&origin=11) group, ...), in promoting data-model co-development along with research and applications of (multimodal) LLMs! ---- ## News -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-08-09] We propose Img-Diff, which enhances the performance of multimodal large language models through *contrastive data synthesis*, achieving a score that is 12 points higher than GPT-4V on the [MMVP benchmark](https://tsb0601.github.io/mmvp_blog/). See more details in our [paper](https://arxiv.org/abs/2408.04594), and download the dataset from [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff) and [modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff). -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information. -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through a co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms. -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute! -- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information. +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] We release our 2.0 paper, [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf). It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes. +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] We support post-tuning scenarios better, via 20+ related new [OPs](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2), and via unified [dataset format](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) compatiable to LLaMA-Factory and ModelScope-Swift. +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] We propose *HumanVBench*, which comprises 17 human-centric tasks with synthetic data, benchmarking video-MLLMs' capabilities from views of inner emotion and outer manifestations. See more details in our [paper](https://arxiv.org/abs/2412.17574), and try to [evaluate](https://github.com/modelscope/data-juicer/tree/HumanVBench) your models with it. +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-11-22] We release DJ [v1.0.0](https://github.com/modelscope/data-juicer/releases/tag/v1.0.0), in which we refactored Data-Juicer's *Operator*, *Dataset*, *Sandbox* and many other modules for better usability, such as supporting fault-tolerant, FastAPI and adaptive resource management. +- [2024-08-25] We give a [tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) about data processing for multimodal LLMs in KDD'2024. +
History News: > +- [2024-08-09] We propose Img-Diff, which enhances the performance of multimodal large language models through *contrastive data synthesis*, achieving a score that is 12 points higher than GPT-4V on the [MMVP benchmark](https://tsb0601.github.io/mmvp_blog/). See more details in our [paper](https://arxiv.org/abs/2408.04594), and download the dataset from [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff) and [modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff). +- [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information. +- [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through a co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms. +- [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute! +- [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information. - [2024-03-07] We release **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)** now! In this new version, we support more features for **multimodal data (including video now)**, and introduce **[DJ-SORA](docs/DJ_SORA.md)** to provide open large-scale, high-quality datasets for SORA-like models. - [2024-02-20] We have actively maintained an *awesome list of LLM-Data*, welcome to [visit](docs/awesome_llm_data.md) and contribute! @@ -67,83 +73,87 @@ Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033). Table of Contents ================= -- [Data-Juicer: A One-Stop Data Processing System for Large Language Models](#data-juicer--a-one-stop-data-processing-system-for-large-language-models) - - [News](#news) -- [Table of Contents](#table-of-contents) - - [Features](#features) - - [Documentation Index ](#documentation-index-) - - [Demos](#demos) +- [News](#news) +- [Why Data-Juicer?](#why-data-juicer) +- [DJ-Cookbook](#dj-cookbook) + - [Curated Resources](#curated-resources) + - [Coding with DJ](#coding-with-dj) + - [Use Cases \& Data Recipes](#use-cases--data-recipes) + - [Interactive Examples](#interactive-examples) +- [Installation](#installation) - [Prerequisites](#prerequisites) - - [Installation](#installation) - - [From Source](#from-source) - - [Using pip](#using-pip) - - [Using Docker](#using-docker) - - [Installation check](#installation-check) - - [Quick Start](#quick-start) - - [Data Processing](#data-processing) - - [Distributed Data Processing](#distributed-data-processing) - - [Data Analysis](#data-analysis) - - [Data Visualization](#data-visualization) - - [Build Up Config Files](#build-up-config-files) - - [Sandbox](#sandbox) - - [Preprocess Raw Data (Optional)](#preprocess-raw-data-optional) - - [For Docker Users](#for-docker-users) - - [Data Recipes](#data-recipes) - - [License](#license) - - [Contributing](#contributing) - - [Acknowledgement](#acknowledgement) - - [References](#references) - - -## Features - -![Overview](https://img.alicdn.com/imgextra/i4/O1CN01WYQP3Z1JHsaXaQDK6_!!6000000001004-0-tps-3640-1812.jpg) + - [From Source](#from-source) + - [Using pip](#using-pip) + - [Using Docker](#using-docker) + - [Installation check](#installation-check) + - [For Video-related Operators](#for-video-related-operators) +- [Quick Start](#quick-start) + - [Data Processing](#data-processing) + - [Distributed Data Processing](#distributed-data-processing) + - [Data Analysis](#data-analysis) + - [Data Visualization](#data-visualization) + - [Build Up Config Files](#build-up-config-files) + - [Sandbox](#sandbox) + - [Preprocess Raw Data (Optional)](#preprocess-raw-data-optional) + - [For Docker Users](#for-docker-users) +- [License](#license) +- [Contributing](#contributing) +- [Acknowledgement](#acknowledgement) +- [References](#references) + + +## Why Data-Juicer? + +![Overview](https://img.alicdn.com/imgextra/i4/O1CN01uawwRu1JMSdafy5lF_!!6000000001014-2-tps-4034-4146.png) - **Systematic & Reusable**: - Empowering users with a systematic library of 80+ core [OPs](docs/Operators.md), 20+ reusable [config recipes](configs), and 20+ feature-rich + Empowering users with a systematic library of 100+ core [OPs](docs/Operators.md), and 50+ reusable [config recipes](configs) and dedicated [toolkits](#documentation), designed to - function independently of specific multimodal LLM datasets and processing pipelines. - -- **Data-in-the-loop & Sandbox**: Supporting one-stop data-model collaborative development, enabling rapid iteration - through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops based on data and model, - visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models. - ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg) - -- **Towards production environment**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion) - requiring less memory and CPU usage, optimized with automatic fault-toleration. - ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg) + function independently of specific multimodal LLM datasets and processing pipelines. Supporting data analysis, cleaning, and synthesis in pre-training, post-tuning, en, zh, and more scenarios. -- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data - processing recipes](configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios. Validated on - reference LLaMA and LLaVA models. - ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png) +- **User-Friendly & Extensible**: + Designed for simplicity and flexibility, with easy-start [guides](#quick-start), and [DJ-Cookbook](#dj-cookbook) containing fruitful demo usages. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing. -- **Flexible & Extensible**: Accommodating most types of data formats (e.g., jsonl, parquet, csv, ...) and allowing flexible combinations of OPs. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing. +- **Efficient & Robust**: Providing performance-optimized [parallel data processing](docs/distributed) (Aliyun-PAI\Ray\CUDA\OP Fusion), + faster with less resource usage, verified in large-scale production environments. -- **User-Friendly Experience**: Designed for simplicity, with [comprehensive documentation](#documents), [easy start guides](#quick-start) and [demo configs](configs/README.md), and intuitive configuration with simple adding/removing OPs from [existing configs](configs/config_all.yaml). +- **Effect-Proven & Sandbox**: Supporting data-model co-development, enabling rapid iteration + through the [sandbox laboratory](docs/Sandbox.md), and providing features such as feedback loops and visualization, so that you can better understand and improve your data and models. Many effect-proven datasets and models have been derived from DJ, in scenarios such as pre-training, text-to-video and image-to-text generation. + ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg) -## Documentation Index +## DJ-Cookbook +### Curated Resources +- [KDD-Tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) +- [Awesome LLM-Data](docs/awesome_llm_data.md) +- ["Bad" Data Exhibition](docs/BadDataExhibition.md) -- [Overview](README.md) +### Coding with DJ +- [Overview of DJ](README.md) - [Operator Zoo](docs/Operators.md) -- [Configs](configs/README.md) +- [Quick Start](docs/QuickStart.md) +- [Configuration](configs/README.md) - [Developer Guide](docs/DeveloperGuide.md) - [API references](https://modelscope.github.io/data-juicer/) -- [KDD-Tutorial](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) -- ["Bad" Data Exhibition](docs/BadDataExhibition.md) -- [Awesome LLM-Data](docs/awesome_llm_data.md) -- Dedicated Toolkits - - [Quality Classifier](tools/quality_classifier/README.md) - - [Auto Evaluation](tools/evaluator/README.md) - - [Preprocess](tools/preprocess/README.md) - - [Postprocess](tools/postprocess/README.md) +- [Preprocess Tools](tools/preprocess/README.md) +- [Postprocess Tools](tools/postprocess/README.md) +- [Format Conversion](tools/fmt_conversion/README.md) +- [Sandbox](docs/Sandbox.md) +- [Quality Classifier](tools/quality_classifier/README.md) +- [Auto Evaluation](tools/evaluator/README.md) +- [Third-parties Integration](thirdparty/README.md) + +### Use Cases & Data Recipes +- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md) +- [Recipes for data process in RedPajama](configs/redpajama/README.md) +- [Refined recipes for pre-training text data](configs/data_juicer_recipes/README.md) +- [Refined recipes for fine-tuning text data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset) +- [Refined recipes for pre-training multi-modal data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-multimodal-dataset) - [DJ-SORA](docs/DJ_SORA.md) -- [Third-parties (LLM Ecosystems)](thirdparty/README.md) -## Demos +### Interactive Examples - Introduction to Data-Juicer [[ModelScope](https://modelscope.cn/studios/Data-Juicer/overview_scan/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/overview_scan)] - Data Visualization: - Basic Statistics [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_visulization_statistics/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/data_visualization_statistics)] @@ -161,13 +171,13 @@ Table of Contents - Data Sampling and Mixture [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_mixture/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/data_mixture)] - Data Processing Loop [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_loop/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/data_process_loop)] -## Prerequisites +## Installation + +### Prerequisites - Recommend Python>=3.9,<=3.10 - gcc >= 5 (at least C++14 support) -## Installation - ### From Source - Run the following commands to install the latest basic `data_juicer` version in @@ -181,8 +191,8 @@ pip install -v -e . ```shell cd -pip install -v -e . # install a minimal dependencies, which support the basic functions -pip install -v -e .[tools] # install a subset of tools dependencies +pip install -v -e . # Install minimal dependencies, which support the basic functions +pip install -v -e .[tools] # Install a subset of tools dependencies ``` The dependency options are listed below: @@ -199,7 +209,7 @@ The dependency options are listed below: - Install dependencies for specific OPs -With the growth of the number of OPs, the dependencies of all OPs becomes very heavy. Instead of using the command `pip install -v -e .[sci]` to install all dependencies, +With the growth of the number of OPs, the dependencies of all OPs become very heavy. Instead of using the command `pip install -v -e .[sci]` to install all dependencies, we provide two alternative, lighter options: - Automatic Minimal Dependency Installation: During the execution of Data-Juicer, minimal dependencies will be automatically installed. This allows for immediate execution, but may potentially lead to dependency conflicts. @@ -243,7 +253,7 @@ pip install py-data-juicer docker build -t datajuicer/data-juicer: . ``` - - The format of `` is like `v0.2.0`, which is the same as release version tag. + - The format of `` is like `v0.2.0`, which is the same as the release version tag. ### Installation check @@ -279,10 +289,10 @@ python tools/process_data.py --config configs/demo/process.yaml dj-process --config configs/demo/process.yaml ``` -- **Note:** For some operators that involve third-party models or resources which are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first. +- **Note:** For some operators that involve third-party models or resources that are not stored locally on your computer, it might be slow for the first running because these ops need to download corresponding resources into a directory first. The default download cache directory is `~/.cache/data_juicer`. Change the cache location by setting the shell environment variable, `DATA_JUICER_CACHE_HOME` to another directory, and you can also change `DATA_JUICER_MODELS_CACHE` or `DATA_JUICER_ASSETS_CACHE` in the same way: -- **Note:** When using operators with third-party models, it's necessary to declare the corresponding `mem_required` in the configuration file (you can refer to the settings in the `config_all.yaml` file). During runtime, Data-Juicer will control the number of processes based on memory availability and the memory requirements of the operator models to achieve better data processing efficiency. When running with CUDA environment, if the mem_required for an operator is not declared correctly, it could potentially lead to a CUDA Out of Memory issue. +- **Note:** When using operators with third-party models, it's necessary to declare the corresponding `mem_required` in the configuration file (you can refer to the settings in the `config_all.yaml` file). During runtime, Data-Juicer will control the number of processes based on memory availability and the memory requirements of the operator models to achieve better data processing efficiency. When running with CUDA environments, if the mem_required for an operator is not declared correctly, it could potentially lead to a CUDA Out of Memory issue. ```shell # cache home @@ -293,7 +303,7 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models" export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets" ``` -#### Flexible Programming Interface +- **Flexible Programming Interface:** We provide various simple interfaces for users to choose from as follows. ```python #... init op & dataset ... @@ -319,7 +329,8 @@ python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo. ``` - To run data processing across multiple machines, it is necessary to ensure that all distributed nodes can access the corresponding data paths (for example, by mounting the respective data paths on a file-sharing system such as NAS). -- The deduplicator operators for RAY mode are different from the single-machine version, and all those operators are prefixed with `ray`, e.g. `ray_video_deduplicator` and `ray_document_deduplicator`. Those operators also rely on a [Redis](https://redis.io/) instance. So in addition to starting the RAY cluster, you also need to setup your Redis instance in advance and provide `host` and `port` of your Redis instance in configuration. +- The deduplication operators for RAY mode are different from the single-machine version, and all those operators are prefixed with `ray`, e.g. `ray_video_deduplicator` and `ray_document_deduplicator`. +- More details can be found in the doc for [distributed processing](./docs/distributed_processing.md). > Users can also opt not to use RAY and instead split the dataset to run on a cluster with [Slurm](https://slurm.schedmd.com/). In this case, please use the default Data-Juicer without RAY. > [Aliyun PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc) supports the RAY framework, Slurm framework, etc. Users can directly create RAY jobs and Slurm jobs on the DLC cluster. @@ -340,7 +351,7 @@ dj-analyze --config configs/demo/analyzer.yaml dj-analyze --auto --dataset_path xx.jsonl [--auto_num 1000] ``` -- **Note:** Analyzer only compute stats for Filters that produce stats or other OPs that produce tags/categories in meta. So other OPs will be ignored in the analysis process. We use the following registries to decorate OPs: +- **Note:** Analyzer only computes stats for Filters that produce stats or other OPs that produce tags/categories in meta. So other OPs will be ignored in the analysis process. We use the following registries to decorate OPs: - `NON_STATS_FILTERS`: decorate Filters that **DO NOT** produce any stats. - `TAGGING_OPS`: decorate OPs that **DO** produce tags/categories in meta field. @@ -390,13 +401,13 @@ python tools/sandbox_starter.py --config configs/demo/sandbox/sandbox.yaml ``` ### Preprocess Raw Data (Optional) -- Our formatters support some common input dataset formats for now: +- Our Formatters support some common input dataset formats for now: - Multi-sample in one file: jsonl/json, parquet, csv/tsv, etc. - Single-sample in one file: txt, code, docx, pdf, etc. - However, data from different sources are complicated and diverse. Such as: - [Raw arXiv data downloaded from S3](https://info.arxiv.org/help/bulk_data_s3.html) include thousands of tar files and even more gzip files in them, and expected tex files are embedded in the gzip files so they are hard to obtain directly. - Some crawled data include different kinds of files (pdf, html, docx, etc.). And extra information like tables, charts, and so on is hard to extract. -- It's impossible to handle all kinds of data in Data-Juicer, issues/PRs are welcome to contribute to process new data types! +- It's impossible to handle all kinds of data in Data-Juicer, issues/PRs are welcome to contribute to processing new data types! - Thus, we provide some **common preprocessing tools** in [`tools/preprocess`](tools/preprocess/) for you to preprocess these data. - You are welcome to make your contributions to new preprocessing tools for the community. - We **highly recommend** that complicated data can be preprocessed to jsonl or parquet files. @@ -442,13 +453,6 @@ docker exec -it bash

🔼 back to index

-## Data Recipes -- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md) -- [Recipes for data process in RedPajama](configs/redpajama/README.md) -- [Refined recipes for pre-training text data](configs/data_juicer_recipes/README.md) -- [Refined recipes for fine-tuning text data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset) -- [Refined recipes for pre-training multi-modal data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-multimodal-dataset) - ## License @@ -456,20 +460,15 @@ Data-Juicer is released under Apache License 2.0. ## Contributing We are in a rapidly developing field and greatly welcome contributions of new -features, bug fixes and better documentations. Please refer to +features, bug fixes, and better documentation. Please refer to [How-to Guide for Developers](docs/DeveloperGuide.md). -If you have any questions, please join our [discussion groups](README.md). - ## Acknowledgement -Data-Juicer is used across various LLM products and research initiatives, -including industrial LLMs from Alibaba Cloud's Tongyi, such as Dianjin for -financial analysis, and Zhiwen for reading assistant, as well as the Alibaba -Cloud's platform for AI (PAI). -We look forward to more of your experience, suggestions and discussions for collaboration! +Data-Juicer is used across various foundation model applications and research initiatives, such as industrial scenarios in Alibaba Tongyi and Alibaba Cloud's platform for AI (PAI). +We look forward to more of your experience, suggestions, and discussions for collaboration! -Data-Juicer thanks and refers to several community projects, such as -[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... +Data-Juicer thanks many community [contributers](https://github.com/modelscope/data-juicer/graphs/contributors) and open-source projects, such as +[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), .... @@ -485,15 +484,19 @@ If you find our work useful for your research or development, please kindly cite ```
- More related papers from Data-Juicer Team: + More related papers from the Data-Juicer Team: > +- [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf) + - [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784) - [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583) - [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594) +- [HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data](https://arxiv.org/abs/2412.17574) + - [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
@@ -501,3 +504,4 @@ If you find our work useful for your research or development, please kindly cite

🔼 back to index

+ From 4287a7c93e57e0e44e9e78c08eddb3da74027b0d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=81=93=E8=BE=95?= Date: Sun, 12 Jan 2025 14:21:03 +0800 Subject: [PATCH 2/4] Fix bad links --- README.md | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 4e7671750..f32f08fac 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -[[中文主页]](README_ZH.md) | [[Docs]](#documents) | [[OperatorZoo]](docs/Operators.md) | [[API]](https://modelscope.github.io/data-juicer) | [[Awesome LLM Data]](docs/awesome_llm_data.md) +[[中文主页]](README_ZH.md) | [[DJ-Cookbook]](#dj-cookbook) | [[OperatorZoo]](docs/Operators.md) | [[API]](https://modelscope.github.io/data-juicer) | [[Awesome LLM Data]](docs/awesome_llm_data.md) # Data Processing for and with Foundation Models @@ -10,17 +10,17 @@ [![pypi version](https://img.shields.io/pypi/v/py-data-juicer?logo=pypi&color=026cad)](https://pypi.org/project/py-data-juicer) [![Docker version](https://img.shields.io/docker/v/datajuicer/data-juicer?logo=docker&label=Docker&color=498bdf)](https://hub.docker.com/r/datajuicer/data-juicer) -[![DataModality](https://img.shields.io/badge/DataModality-Text,Image,Audio,Video-brightgreen.svg)](docs/DeveloperGuide_ZH.md) -[![Usage](https://img.shields.io/badge/Usage-Cleaning,Synthesis,Analysis-FFD21E.svg)](docs/DeveloperGuide_ZH.md) +[![DataModality](https://img.shields.io/badge/DataModality-Text,Image,Audio,Video-brightgreen.svg)](#dj-cookbook) +[![Usage](https://img.shields.io/badge/Usage-Cleaning,Synthesis,Analysis-FFD21E.svg)](#dj-cookbook) [![ModelScope- Demos](https://img.shields.io/badge/ModelScope-Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](https://modelscope.cn/studios?name=Data-Jiucer&page=1&sort=latest&type=1) [![HuggingFace- Demos](https://img.shields.io/badge/🤗HuggingFace-Demos-4e29ff.svg)](https://huggingface.co/spaces?&search=datajuicer) -[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documents) -[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documents) -[![OpZoo](https://img.shields.io/badge/Doc-OperatorZoo-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) -[![算子池](https://img.shields.io/badge/文档-算子池-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) +[![Document_List](https://img.shields.io/badge/Doc-DJ_Cookbook-blue?logo=Markdown)](#dj-cookbook) +[![文档列表](https://img.shields.io/badge/文档-DJ指南-blue?logo=Markdown)](README_ZH.md#dj-cookbook) +[![OpZoo](https://img.shields.io/badge/Doc-OperatorZoo-blue?logo=Markdown)](docs/Operators.md) +[![算子池](https://img.shields.io/badge/文档-算子池-blue?logo=Markdown)](docs/Operators_ZH.md) [![Paper](http://img.shields.io/badge/cs.LG-1.0Paper(SIGMOD'24)-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033) [![Paper](http://img.shields.io/badge/cs.AI-2.0Paper-B31B1B?logo=arxiv&logoColor=red)](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf) @@ -34,7 +34,7 @@ We provide a [playground](http://8.138.149.181/) with a managed JupyterLab. [Try [Platform for AI of Alibaba Cloud (PAI)](https://www.aliyun.com/product/bigdata/learn) has cited our work and integrated Data-Juicer into its data processing products. PAI is an AI Native large model and AIGC engineering platform that provides dataset management, computing power management, model tool chain, model development, model training, model deployment, and AI asset management. For documentation on data processing, please refer to: [PAI-Data Processing for Large Models](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX). Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8253f30mgpjw) channel, [DingDing](https://qr.dingtalk.com/action/joingroup?code=v1,k1,YFIXM2leDEk7gJP5aMC95AfYT+Oo/EP/ihnaIEhMyJM=&_dt_no_comment=1&origin=11) group, ...), in promoting data-model co-development along with research and applications of foundation models! ----- + ## News - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] We release our 2.0 paper, [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf). It now can process 70B data samples within 2.1h, using 6400 CPU cores on 50 Ray nodes from Alibaba Cloud cluster, and deduplicate 5TB data within 2.8h using 1280 CPU cores on 8 Ray nodes. @@ -71,12 +71,11 @@ Besides, our paper is also updated to [v3](https://arxiv.org/abs/2309.02033). Table of Contents ================= -- [Data-Juicer is being actively updated and maintained. We will periodically enhance and add more features, data recipes and datasets. We welcome you to join us (via issues, PRs, Slack channel, DingDing group, ...), in promoting data-model co-development along with research and applications of foundation models!](#data-juicer-is-being-actively-updated-and-maintained-we-will-periodically-enhance-and-add-more-features-data-recipes-and-datasets--we-welcome-you-to-join-us-via-issues-prs-slack--channel-dingding-group--in-promoting-data-model-co-development-along-with-research-and-applications-of-foundation-models) - [News](#news) - [Why Data-Juicer?](#why-data-juicer) - [DJ-Cookbook](#dj-cookbook) - [Curated Resources](#curated-resources) - - [Coding with DJ](#coding-with-dj) + - [Coding with Data-Juicer (DJ)](#coding-with-data-juicer-dj) - [Use Cases \& Data Recipes](#use-cases--data-recipes) - [Interactive Examples](#interactive-examples) - [Installation](#installation) @@ -113,7 +112,7 @@ Table of Contents - **User-Friendly & Extensible**: Designed for simplicity and flexibility, with easy-start [guides](#quick-start), and [DJ-Cookbook](#dj-cookbook) containing fruitful demo usages. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing. -- **Efficient & Robust**: Providing performance-optimized [parallel data processing](docs/distributed) (Aliyun-PAI\Ray\CUDA\OP Fusion), +- **Efficient & Robust**: Providing performance-optimized [parallel data processing](docs) (Aliyun-PAI\Ray\CUDA\OP Fusion), faster with less resource usage, verified in large-scale production environments. @@ -128,7 +127,7 @@ Table of Contents - [Awesome LLM-Data](docs/awesome_llm_data.md) - ["Bad" Data Exhibition](docs/BadDataExhibition.md) -### Coding with DJ +### Coding with Data-Juicer (DJ) - [Overview of DJ](README.md) - [Operator Zoo](docs/Operators.md) - [Quick Start](docs/QuickStart.md) @@ -329,7 +328,7 @@ python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo. - To run data processing across multiple machines, it is necessary to ensure that all distributed nodes can access the corresponding data paths (for example, by mounting the respective data paths on a file-sharing system such as NAS). - The deduplication operators for RAY mode are different from the single-machine version, and all those operators are prefixed with `ray`, e.g. `ray_video_deduplicator` and `ray_document_deduplicator`. -- More details can be found in the doc for [distributed processing](./docs/distributed_processing.md). +- More details can be found in the doc for [distributed processing](docs/Distributed.md). > Users can also opt not to use RAY and instead split the dataset to run on a cluster with [Slurm](https://slurm.schedmd.com/). In this case, please use the default Data-Juicer without RAY. > [Aliyun PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc) supports the RAY framework, Slurm framework, etc. Users can directly create RAY jobs and Slurm jobs on the DLC cluster. From 7b729db28ff08e8129a73a1fa88c72e51f3f2c62 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=81=93=E8=BE=95?= Date: Sun, 12 Jan 2025 15:30:48 +0800 Subject: [PATCH 3/4] add ZH version and minor fix --- README.md | 10 +- README_ZH.md | 208 ++++++++++++++++++++++------------------- docs/DeveloperGuide.md | 13 ++- 3 files changed, 121 insertions(+), 110 deletions(-) diff --git a/README.md b/README.md index f32f08fac..a2d437a9c 100644 --- a/README.md +++ b/README.md @@ -105,14 +105,14 @@ Table of Contents ![Overview](https://img.alicdn.com/imgextra/i4/O1CN01uawwRu1JMSdafy5lF_!!6000000001014-2-tps-4034-4146.png) - **Systematic & Reusable**: - Empowering users with a systematic library of 100+ core [OPs](docs/Operators.md), and 50+ reusable [config recipes](configs) and - dedicated [toolkits](#documentation), designed to + Empowering users with a systematic library of 100+ core [OPs](docs/Operators.md), and 50+ reusable config recipes and + dedicated toolkits, designed to function independently of specific multimodal LLM datasets and processing pipelines. Supporting data analysis, cleaning, and synthesis in pre-training, post-tuning, en, zh, and more scenarios. - **User-Friendly & Extensible**: Designed for simplicity and flexibility, with easy-start [guides](#quick-start), and [DJ-Cookbook](#dj-cookbook) containing fruitful demo usages. Feel free to [implement your own OPs](docs/DeveloperGuide.md#build-your-own-ops) for customizable data processing. -- **Efficient & Robust**: Providing performance-optimized [parallel data processing](docs) (Aliyun-PAI\Ray\CUDA\OP Fusion), +- **Efficient & Robust**: Providing performance-optimized [parallel data processing](docs/Distributed.md) (Aliyun-PAI\Ray\CUDA\OP Fusion), faster with less resource usage, verified in large-scale production environments. @@ -130,7 +130,7 @@ Table of Contents ### Coding with Data-Juicer (DJ) - [Overview of DJ](README.md) - [Operator Zoo](docs/Operators.md) -- [Quick Start](docs/QuickStart.md) +- [Quick Start](#quick-start) - [Configuration](configs/README.md) - [Developer Guide](docs/DeveloperGuide.md) - [API references](https://modelscope.github.io/data-juicer/) @@ -140,7 +140,7 @@ Table of Contents - [Sandbox](docs/Sandbox.md) - [Quality Classifier](tools/quality_classifier/README.md) - [Auto Evaluation](tools/evaluator/README.md) -- [Third-parties Integration](thirdparty/README.md) +- [Third-parties Integration](thirdparty/LLM_ecosystems/README.md) ### Use Cases & Data Recipes - [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md) diff --git a/README_ZH.md b/README_ZH.md index 27bcb72f2..22799262d 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -1,47 +1,53 @@ -[[English Page]](README.md) | [[文档索引]](#documents) | [[API]](https://modelscope.github.io/data-juicer) | [[DJ-SORA]](docs/DJ_SORA_ZH.md) | [[Awesome List]](docs/awesome_llm_data.md) +[[英文主页]](README.md) | [[DJ-Cookbook]](#dj-cookbook) | [[算子池]](docs/Operators_ZH.md) | [[API]](https://modelscope.github.io/data-juicer) | [[Awesome LLM Data]](docs/awesome_llm_data.md) -# Data-Juicer: 为大模型提供更高质量、更丰富、更易“消化”的数据 +# Data Processing for and with Foundation Models - Data-Juicer + Data-Juicer ![](https://img.shields.io/badge/language-Python-214870.svg) ![](https://img.shields.io/badge/license-Apache--2.0-000000.svg) [![pypi version](https://img.shields.io/pypi/v/py-data-juicer?logo=pypi&color=026cad)](https://pypi.org/project/py-data-juicer) [![Docker version](https://img.shields.io/docker/v/datajuicer/data-juicer?logo=docker&label=Docker&color=498bdf)](https://hub.docker.com/r/datajuicer/data-juicer) -[![DataModality](https://img.shields.io/badge/DataModality-Text,Image,Audio,Video-brightgreen.svg)](docs/DeveloperGuide_ZH.md) -[![Usage](https://img.shields.io/badge/Usage-Cleaning,Generation,Analysis-FFD21E.svg)](docs/DeveloperGuide_ZH.md) +[![DataModality](https://img.shields.io/badge/DataModality-Text,Image,Audio,Video-brightgreen.svg)](#dj-cookbook) +[![Usage](https://img.shields.io/badge/Usage-Cleaning,Synthesis,Analysis-FFD21E.svg)](#dj-cookbook) [![ModelScope- Demos](https://img.shields.io/badge/ModelScope-Demos-4e29ff.svg?logo=data:image/svg+xml;base64,PHN2ZyB2aWV3Qm94PSIwIDAgMjI0IDEyMS4zMyIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCTxwYXRoIGQ9Im0wIDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtOTkuMTQgNzMuNDloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xNzYuMDkgOTkuMTRoLTI1LjY1djIyLjE5aDQ3Ljg0di00Ny44NGgtMjIuMTl6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTEyNC43OSA0Ny44NGgyNS42NXYyNS42NWgtMjUuNjV6IiBmaWxsPSIjMzZjZmQxIiAvPgoJPHBhdGggZD0ibTAgMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xOTguMjggNDcuODRoMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzYyNGFmZiIgLz4KCTxwYXRoIGQ9Im0xOTguMjggMjIuMTloMjUuNjV2MjUuNjVoLTI1LjY1eiIgZmlsbD0iIzM2Y2ZkMSIgLz4KCTxwYXRoIGQ9Im0xNTAuNDQgMHYyMi4xOWgyNS42NXYyNS42NWgyMi4xOXYtNDcuODR6IiBmaWxsPSIjNjI0YWZmIiAvPgoJPHBhdGggZD0ibTczLjQ5IDQ3Ljg0aDI1LjY1djI1LjY1aC0yNS42NXoiIGZpbGw9IiMzNmNmZDEiIC8+Cgk8cGF0aCBkPSJtNDcuODQgMjIuMTloMjUuNjV2LTIyLjE5aC00Ny44NHY0Ny44NGgyMi4xOXoiIGZpbGw9IiM2MjRhZmYiIC8+Cgk8cGF0aCBkPSJtNDcuODQgNzMuNDloLTIyLjE5djQ3Ljg0aDQ3Ljg0di0yMi4xOWgtMjUuNjV6IiBmaWxsPSIjNjI0YWZmIiAvPgo8L3N2Zz4K)](https://modelscope.cn/studios?name=Data-Jiucer&page=1&sort=latest&type=1) [![HuggingFace- Demos](https://img.shields.io/badge/🤗HuggingFace-Demos-4e29ff.svg)](https://huggingface.co/spaces?&search=datajuicer) -[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](README.md#documents) -[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](#documents) -[![API Reference](https://img.shields.io/badge/Docs-API_Reference-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) -[![Paper](http://img.shields.io/badge/cs.LG-arXiv%3A2309.02033-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033) +[![Document_List](https://img.shields.io/badge/Docs-English-blue?logo=Markdown)](#documents) +[![文档列表](https://img.shields.io/badge/文档-中文-blue?logo=Markdown)](README_ZH.md#documents) +[![OpZoo](https://img.shields.io/badge/Doc-OperatorZoo-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) +[![算子池](https://img.shields.io/badge/文档-算子池-blue?logo=Markdown)](https://modelscope.github.io/data-juicer/) +[![Paper](http://img.shields.io/badge/cs.LG-1.0Paper(SIGMOD'24)-B31B1B?logo=arxiv&logoColor=red)](https://arxiv.org/abs/2309.02033) +[![Paper](http://img.shields.io/badge/cs.AI-2.0Paper-B31B1B?logo=arxiv&logoColor=red)](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf) -Data-Juicer 是一个一站式**多模态**数据处理系统,旨在为大语言模型 (LLM) 提供更高质量、更丰富、更易“消化”的数据。 +Data-Juicer 是一个一站式系统,面向大模型的文本及多模态数据处理。我们提供了一个基于 JupyterLab 的 [Playground](http://8.138.149.181/),您可以从浏览器中在线试用 Data-Juicer。 如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献) 。 -我们提供了一个基于 JupyterLab 的 [Playground](http://8.138.149.181/),您可以从浏览器中在线试用 Data-Juicer。 如果Data-Juicer对您的研发有帮助,请引用我们的[工作](#参考文献) 。 +[阿里云人工智能平台 PAI](https://www.aliyun.com/product/bigdata/learn) 已引用Data-Juicer并将其能力集成到PAI的数据处理产品中。PAI提供包含数据集管理、算力管理、模型工具链、模型开发、模型训练、模型部署、AI资产管理在内的功能模块,为用户提供高性能、高稳定、企业级的大模型工程化能力。数据处理的使用文档请参考:[PAI-大模型数据处理](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX)。 -[阿里云人工智能平台 PAI](https://www.aliyun.com/product/bigdata/learn) 已引用我们的工作,将Data-Juicer的能力集成到PAI的数据处理产品中。PAI提供包含数据集管理、算力管理、模型工具链、模型开发、模型训练、模型部署、AI资产管理在内的功能模块,为用户提供高性能、高稳定、企业级的大模型工程化能力。数据处理的使用文档请参考:[PAI-大模型数据处理](https://help.aliyun.com/zh/pai/user-guide/components-related-to-data-processing-for-foundation-models/?spm=a2c4g.11186623.0.0.3e9821a69kWdvX)。 - -Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们(issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?code=v1,k1,YFIXM2leDEk7gJP5aMC95AfYT+Oo/EP/ihnaIEhMyJM=&_dt_no_comment=1&origin=11)/...),一起推进LLM-数据的协同开发和研究! +Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多的功能和数据菜谱。热烈欢迎您加入我们(issues/PRs/[Slack频道](https://join.slack.com/t/data-juicer/shared_invite/zt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg?spm=a2c22.12281976.0.0.7a8275bc8g7ypp) /[钉钉群](https://qr.dingtalk.com/action/joingroup?code=v1,k1,YFIXM2leDEk7gJP5aMC95AfYT+Oo/EP/ihnaIEhMyJM=&_dt_no_comment=1&origin=11)/...),一起推进大模型的数据-模型协同开发和研究应用! ---- ## 新消息 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-11] 我们发布了 2.0 版论文 [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf)。DJ现在可以使用阿里云集群中 50 个 Ray 节点上的 6400 个 CPU 核心在 2.1 小时内处理 70B 数据样本,并使用 8 个 Ray 节点上的 1280 个 CPU 核心在 2.8 小时内对 5TB 数据进行重复数据删除。 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-01-03] 我们通过 20 多个相关的新 [OP](https://github.com/modelscope/data-juicer/releases/tag/v1.0.2) 以及与 LLaMA-Factory 和 ModelScope-Swift 兼容的统一 [数据集格式](https://github.com/modelscope/data-juicer/releases/tag/v1.0.3) 更好地支持Post-Tuning场景。 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2025-12-17] 我们提出了 *HumanVBench*,它包含 17 个以人为中心的任务,使用合成数据,从内在情感和外在表现的角度对视频 MLLM 的能力进行基准测试。请参阅我们的 [论文](https://arxiv.org/abs/2412.17574) 中的更多详细信息,并尝试使用它 [评估](https://github.com/modelscope/data-juicer/tree/HumanVBench) 您的模型。 +- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-11-22] 我们发布 DJ [v1.0.0](https://github.com/modelscope/data-juicer/releases/tag/v1.0.0),其中我们重构了 Data-Juicer 的 *Operator*、*Dataset*、*Sandbox* 和许多其他模块以提高可用性,例如支持容错、FastAPI 和自适应资源管理。 +- [2024-08-25] 我们在 KDD'2024 中提供了有关多模态 LLM 数据处理的[教程](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html)。 + +
+ History News: +> + - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-08-09] 我们提出了Img-Diff,它通过*对比数据合成*来增强多模态大型语言模型的性能,在[MMVP benchmark](https://tsb0601.github.io/mmvp_blog/)中比GPT-4V高出12个点。 更多细节请参阅我们的 [论文](https://arxiv.org/abs/2408.04594), 以及从 [huggingface](https://huggingface.co/datasets/datajuicer/Img-Diff) 和 [modelscope](https://modelscope.cn/datasets/Data-Juicer/Img-Diff)下载这份数据集。 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] “天池 Better Synth 多模态大模型数据合成赛”——第四届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532251),了解赛事详情。 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-17] 我们利用Data-Juicer[沙盒实验室套件](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox-ZH.md),通过数据与模型间的系统性研发工作流,调优数据和模型,在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。相关成果已经整理发表在[论文](http://arxiv.org/abs/2407.11784)中,并且模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)和[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布。 - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献! - ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora“数据导演”创意竞速——第三届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532219),了解赛事详情。 -
- History News: -> - - [2024-03-07] 我们现在发布了 **Data-Juicer [v0.2.0](https://github.com/alibaba/data-juicer/releases/tag/v0.2.0)**! 在这个新版本中,我们支持了更多的 **多模态数据(包括视频)** 相关特性。我们还启动了 **[DJ-SORA](docs/DJ_SORA_ZH.md)** ,为SORA-like大模型构建开放的大规模高质量数据集! - [2024-02-20] 我们在积极维护一份关于LLM-Data的*精选列表*,欢迎[访问](docs/awesome_llm_data.md)并参与贡献! - [2024-02-05] 我们的论文被SIGMOD'24 industrial track接收! @@ -58,72 +64,83 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多 目录 === -- [Data-Juicer: 为大语言模型提供更高质量、更丰富、更易“消化”的数据](#data-juicer-为大语言模型提供更高质量更丰富更易消化的数据) - - [新消息](#新消息) -- [目录](#目录) - - [特点](#特点) - - [文档索引 ](#文档索引-) - - [演示样例](#演示样例) +- [新消息](#新消息) +- [为什么选择 Data-Juicer?](#为什么选择-data-juicer) +- [DJ-Cookbook](#dj-cookbook) + - [资源合集](#资源合集) + - [编写Data-Juicer (DJ) 代码](#编写data-juicer-dj-代码) + - [用例与数据菜谱](#用例与数据菜谱) + - [交互类示例](#交互类示例) +- [安装](#安装) - [前置条件](#前置条件) - - [安装](#安装) - - [从源码安装](#从源码安装) - - [使用 pip 安装](#使用-pip-安装) - - [使用 Docker 安装](#使用-docker-安装) - - [安装校验](#安装校验) - - [快速上手](#快速上手) - - [数据处理](#数据处理) - - [分布式数据处理](#分布式数据处理) - - [数据分析](#数据分析) - - [数据可视化](#数据可视化) - - [构建配置文件](#构建配置文件) - - [沙盒实验室](#沙盒实验室) - - [预处理原始数据(可选)](#预处理原始数据可选) - - [对于 Docker 用户](#对于-docker-用户) - - [数据处理菜谱](#数据处理菜谱) - - [开源协议](#开源协议) - - [贡献](#贡献) - - [致谢](#致谢) - - [参考文献](#参考文献) - - -## 特点 - -![Overview](https://img.alicdn.com/imgextra/i4/O1CN01WYQP3Z1JHsaXaQDK6_!!6000000001004-0-tps-3640-1812.jpg) - -* **系统化 & 可复用**:为用户提供系统化且可复用的80+核心[算子](docs/Operators_ZH.md),20+[配置菜谱](configs/README_ZH.md)和20+专用[工具池](#documentation),旨在让多模态数据处理独立于特定的大语言模型数据集和处理流水线。 - -* **数据反馈回路 & 沙盒实验室**:支持一站式数据-模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代,基于数据和模型反馈回路、可视化和多维度自动评估等功能,使您更了解和改进您的数据和模型。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg) - -* **面向生产环境**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,支持自动化处理容错。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg) - -* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png) - -* **用户友好**:设计简单易用,提供全面的[文档](#documents)、简易[入门指南](#快速上手)和[演示配置](configs/README_ZH.md),并且可以轻松地添加/删除[现有配置](configs/config_all.yaml)中的算子。 - -* **灵活 & 易扩展**:支持大多数数据格式(如jsonl、parquet、csv等),并允许灵活组合算子。支持[自定义算子](docs/DeveloperGuide_ZH.md#构建自己的算子),以执行定制化的数据处理。 - - -## 文档索引 - -* [概览](README_ZH.md) -* [算子库](docs/Operators_ZH.md) -* [配置系统](configs/README_ZH.md) -* [开发者指南](docs/DeveloperGuide_ZH.md) -* [API 参考](https://modelscope.github.io/data-juicer/) -* [KDD'24 相关教程](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) -* [“坏”数据展览](docs/BadDataExhibition_ZH.md) -* [Awesome LLM-Data](docs/awesome_llm_data.md) -* 专用工具箱 - * [质量分类器](tools/quality_classifier/README_ZH.md) - * [自动评测](tools/evaluator/README_ZH.md) - * [前处理](tools/preprocess/README_ZH.md) - * [后处理](tools/postprocess/README_ZH.md) -* [DJ-SORA](docs/DJ_SORA_ZH.md) -* [第三方库(大语言模型生态)](thirdparty/README_ZH.md) - + - [从源码安装](#从源码安装) + - [使用 pip 安装](#使用-pip-安装) + - [使用 Docker 安装](#使用-docker-安装) + - [安装校验](#安装校验) + - [使用视频相关算子](#使用视频相关算子) +- [快速上手](#快速上手) + - [数据处理](#数据处理) + - [分布式数据处理](#分布式数据处理) + - [数据分析](#数据分析) + - [数据可视化](#数据可视化) + - [构建配置文件](#构建配置文件) + - [沙盒实验室](#沙盒实验室) + - [预处理原始数据(可选)](#预处理原始数据可选) + - [对于 Docker 用户](#对于-docker-用户) +- [开源协议](#开源协议) +- [贡献](#贡献) +- [致谢](#致谢) +- [参考文献](#参考文献) + + +## 为什么选择 Data-Juicer? + +![概述](https://img.alicdn.com/imgextra/i4/O1CN01uawwRu1JMSdafy5lF_!!6000000001014-2-tps-4034-4146.png) + +- **系统化和可重用**: +系统化地为用户提供 100 多个核心 [算子](docs/Operators.md) 和 50 多个可重用的数据菜谱和 +专用工具套件,旨在解耦于特定的多模态 LLM 数据集和处理管道运行。支持预训练、后训练、英语、中文等场景中的数据分析、清洗和合成。 + +- **易用、可扩展**: +简洁灵活,提供快速[入门指南](#快速上手)和包含丰富使用示例的[DJ-Cookbook](#dj-cookbook)。您可以灵活实现自己的OP,[自定义](docs/DeveloperGuide_ZH.md)数据处理工作流。 + +- **高效、稳定**:提供性能优化的[并行数据处理能力](docs/Distributed_ZH.md)(Aliyun-PAI\Ray\CUDA\OP Fusion), +更快、更少资源消耗,基于大规模生产环境打磨。 + +- **效果验证、沙盒**:支持数据模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)实现快速迭代,提供反馈循环、可视化等功能,让您更好地理解和改进数据和模型。已经有许多基于 DJ 衍生的数据菜谱和模型经过了效用验证,譬如在预训练、文生视频、图文生成等场景。 +![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg) + +## DJ-Cookbook +### 资源合集 +- [KDD'24 相关教程](https://modelscope.github.io/data-juicer/_static/tutorial_kdd24.html) +- [Awesome LLM-Data](docs/awesome_llm_data.md) +- [“坏”数据展览](docs/BadDataExhibition_ZH.md) + +### 编写Data-Juicer (DJ) 代码 +- [DJ概览](README_ZH.md) +- [算子库](docs/Operators_ZH.md) +- [快速上手](#快速上手) +- [配置](configs/README_ZH.md) +- [开发者指南](docs/DeveloperGuide_ZH.md) +- [API参考](https://modelscope.github.io/data-juicer/) +- [预处理工具](tools/preprocess/README_ZH.md) +- [后处理工具](tools/postprocess/README_ZH.md) +- [格式转换](tools/fmt_conversion/README_ZH.md) +- [沙盒](docs/Sandbox-ZH.md) +- [质量分类器](tools/quality_classifier/README_ZH.md) +- [自动评估](tools/evaluator/README_ZH.md) +- [第三方集成](thirdparty/LLM_ecosystems/README_ZH.md) + +### 用例与数据菜谱 -## 演示样例 +* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md) +* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md) +* [预训练文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md) +* [Fine-tuning文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集) +* [预训练多模态数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#before-and-after-refining-for-multimodal-dataset) +* [DJ-SORA](docs/DJ_SORA_ZH.md) +### 交互类示例 * Data-Juicer 介绍 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/overview_scan/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/overview_scan)] * 数据可视化: * 基础指标统计 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_visulization_statistics/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/data_visualization_statistics)] @@ -142,12 +159,13 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多 * 数据处理回路 [[ModelScope](https://modelscope.cn/studios/Data-Juicer/data_process_loop/summary)] [[HuggingFace](https://huggingface.co/spaces/datajuicer/data_process_loop)] -## 前置条件 +## 安装 + +### 前置条件 * 推荐 Python>=3.9,<=3.10 * gcc >= 5 (at least C++14 support) -## 安装 ### 从源码安装 @@ -266,7 +284,7 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models" export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets" ``` -#### 灵活的编程接口 +- **灵活的编程接口:** 我们提供了各种层次的简单编程接口,以供用户选择: ```python # ... init op & dataset ... @@ -295,7 +313,8 @@ python tools/process_data.py --config ./demos/process_video_on_ray/configs/demo. ``` - 如果需要在多机上使用RAY执行数据处理,需要确保所有节点都可以访问对应的数据路径,即将对应的数据路径挂载在共享文件系统(如NAS)中。 - - RAY 模式下的去重算子与单机版本不同,所有 RAY 模式下的去重算子名称都以 `ray` 作为前缀,例如 `ray_video_deduplicator` 和 `ray_document_deduplicator`。这些去重算子依赖于 [Redis](https://redis.io/) 实例.因此使用前除启动 RAY 集群外还需要启动 Redis 实例,并在对应的配置文件中填写 Redis 实例的 `host` 和 `port`。 + - RAY 模式下的去重算子与单机版本不同,所有 RAY 模式下的去重算子名称都以 `ray` 作为前缀,例如 `ray_video_deduplicator` 和 `ray_document_deduplicator`。 + - 更多细节请参考[分布式处理文档](docs/Distributed_ZH.md)。 > 用户也可以不使用 RAY,拆分数据集后使用 [Slurm](https://slurm.schedmd.com/) 在集群上运行,此时使用不包含 RAY 的原版 Data-Juicer 即可。 > [阿里云 PAI-DLC](https://www.aliyun.com/activity/bigdata/pai-dlc) 支持 RAY 框架、Slurm 框架等,用户可以直接在DLC集群上创建 RAY 作业 和 Slurm 作业。 @@ -417,14 +436,6 @@ docker exec -it bash

🔼 back to index

-## 数据处理菜谱 - -* [BLOOM 数据处理菜谱](configs/reproduced_bloom/README_ZH.md) -* [RedPajama 数据处理菜谱](configs/reproduced_redpajama/README_ZH.md) -* [预训练文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md) -* [Fine-tuning文本数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#完善前后的alpaca-cot数据集) -* [预训练多模态数据增强菜谱](configs/data_juicer_recipes/README_ZH.md#before-and-after-refining-for-multimodal-dataset) - ## 开源协议 Data-Juicer 在 Apache License 2.0 协议下发布。 @@ -433,16 +444,13 @@ Data-Juicer 在 Apache License 2.0 协议下发布。 大模型是一个高速发展的领域,我们非常欢迎贡献新功能、修复漏洞以及文档改善。请参考[开发者指南](docs/DeveloperGuide_ZH.md)。 -如果您有任何问题,欢迎加入我们的[讨论群](README_ZH.md) 。 ## 致谢 -Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通义的行业大模型,例如点金 -(金融分析),智文(阅读助手),还有阿里云人工智能平台 (PAI)。 我们期待更多您的体验反馈、建议和合作共建! +Data-Juicer被许多大模型相关产品和研究工作所使用,例子阿里巴巴通义和阿里云人工智能平台 (PAI) 之上的工业界场景。 我们期待更多您的体验反馈、建议和合作共建! -Data-Juicer 感谢并参考了社区开源项目: -[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), .... +Data-Juicer 感谢社区[贡献者](https://github.com/modelscope/data-juicer/graphs/contributors) 和相关的先驱开源项目,譬如[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), .... ## 参考文献 如果您发现我们的工作对您的研发有帮助,请引用以下[论文](https://arxiv.org/abs/2309.02033) 。 @@ -459,12 +467,16 @@ Data-Juicer 感谢并参考了社区开源项目: 更多Data-Juicer团队相关论文: > +- [Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models](https://dail-wlcb.oss-cn-wulanchabu.aliyuncs.com/data_juicer/DJ2.0_arXiv_preview.pdf) + - [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784) - [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583) - [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https://arxiv.org/abs/2408.04594) +- [HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data](https://arxiv.org/abs/2412.17574) + - [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2405.14908)
diff --git a/docs/DeveloperGuide.md b/docs/DeveloperGuide.md index 734f1201a..2a6c746ea 100644 --- a/docs/DeveloperGuide.md +++ b/docs/DeveloperGuide.md @@ -1,12 +1,11 @@ # How-to Guide for Developers -- [How-to Guide for Developers](#how-to-guide-for-developers) - - [Coding Style](#coding-style) - - [Build your own OPs](#build-your-own-ops) - - [(Optional) Make your OP fusible](#optional-make-your-op-fusible) - - [Build your own configs](#build-your-own-configs) - - [Fruitful config sources \& Type hints](#fruitful-config-sources--type-hints) - - [Hierarchical configs and helps](#hierarchical-configs-and-helps) +- [Coding Style](#coding-style) +- [Build your own OPs](#build-your-own-ops) + - [(Optional) Make your OP fusible](#optional-make-your-op-fusible) +- [Build your own configs](#build-your-own-configs) + - [Fruitful config sources \& Type hints](#fruitful-config-sources--type-hints) + - [Hierarchical configs and helps](#hierarchical-configs-and-helps) ## Coding Style From 69d86f9acf11222a527b4ed8f7b34a2870a66008 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=81=93=E8=BE=95?= Date: Mon, 13 Jan 2025 11:14:21 +0800 Subject: [PATCH 4/4] fix bad link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index a2d437a9c..e057e3652 100644 --- a/README.md +++ b/README.md @@ -144,7 +144,7 @@ Table of Contents ### Use Cases & Data Recipes - [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md) -- [Recipes for data process in RedPajama](configs/redpajama/README.md) +- [Recipes for data process in RedPajama](configs/reproduced_redpajama/README.md) - [Refined recipes for pre-training text data](configs/data_juicer_recipes/README.md) - [Refined recipes for fine-tuning text data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-alpaca-cot-dataset) - [Refined recipes for pre-training multi-modal data](configs/data_juicer_recipes/README.md#before-and-after-refining-for-multimodal-dataset)