Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs according to recently refactor and events #366

Merged
merged 4 commits into from
Jul 26, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 56 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ We welcome you to join us (via issues, PRs, [Slack](https://join.slack.com/t/dat
----

## News
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] "Tianchi Better Synth Data Synthesis Competition for Multimodal Large Models" — Our 4th data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532251) for more information.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-17] We utilized the Data-Juicer [Sandbox Laboratory Suite](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox.md) to systematically optimize data and models through an co-development workflow between data and models, achieving a new top spot on the [VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) text-to-video leaderboard. The related achievements have been compiled and published in a [paper](http://arxiv.org/abs/2407.11784), and the model has been released on the [ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V) and [HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V) platforms.
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-12] Our *awesome list of MLLM-Data* has evolved into a systemic [survey](https://arxiv.org/abs/2407.08583) from model-data co-development perspective. Welcome to [explore](docs/awesome_llm_data.md) and contribute!
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora "Data Directors" creative sprint—Our third data-centric LLM competition has kicked off! Please visit the competition's [official website](https://tianchi.aliyun.com/competition/entrance/532219) for more information.
Expand Down Expand Up @@ -96,8 +97,8 @@ Table of Contents
visualization, and multidimensional automatic evaluation, so that you can better understand and improve your data and models.
![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

- **Enhanced Efficiency**: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
requiring less memory and CPU usage, optimized for maximum productivity.
- **Towards production environment **: Providing efficient and parallel data processing pipelines (Aliyun-PAI\Ray\Slurm\CUDA\OP Fusion)
requiring less memory and CPU usage, optimized with automatic fault-toleration.
![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

- **Comprehensive Data Processing Recipes**: Offering tens of [pre-built data
Expand Down Expand Up @@ -154,7 +155,7 @@ Table of Contents

## Installation

### From Source
### From Source

- Run the following commands to install the latest basic `data_juicer` version in
editable mode:
Expand Down Expand Up @@ -229,6 +230,15 @@ You can install FFmpeg using package managers(e.g. sudo apt install ffmpeg on De

Check if your environment path is set correctly by running the ffmpeg command from the terminal.


<br><hr>
<div style="text-align: right;">

[🔼 back to index](#documentation-index-a-namedocuments)
yxdyc marked this conversation as resolved.
Show resolved Hide resolved

</div>


## Quick Start


Expand Down Expand Up @@ -259,6 +269,20 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"
```

#### Flexible Programming Interface
We provide various simple interfaces for users to choose from as follows.
```python
#... init op & dataset ...

# Chain call style, support single operator or operator list
dataset = dataset.process(op)
dataset = dataset.process([op1, op2])
# Functional programming style for quick integration or script prototype iteration
dataset = op(dataset)
dataset = op.run(dataset)
```


### Distributed Data Processing

We have now implemented multi-machine distributed data processing based on [RAY](https://www.ray.io/). The corresponding demos can be run using the following commands:
Expand Down Expand Up @@ -376,6 +400,14 @@ docker run -dit \ # run the container in the background
docker exec -it <container_id> bash
```


<br><hr>
<div style="text-align: right;">

[🔼 back to index](#documentation-index-a-namedocuments)

</div>

## Data Recipes
- [Recipes for data process in BLOOM](configs/reproduced_bloom/README.md)
- [Recipes for data process in RedPajama](configs/redpajama/README.md)
Expand Down Expand Up @@ -417,3 +449,24 @@ If you find our work useful for your research or development, please kindly cite
year={2024}
}
```

<details>
<summary> More related papers from Data-Juicer Team:
</summary>>

- [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)

- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)

- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)

</details>



<br><hr>
<div style="text-align: right;">

[🔼 back to index](#documentation-index-a-namedocuments)

</div>
43 changes: 39 additions & 4 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多
----

## 新消息
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-07-24] “天池 Better Synth 多模态大模型数据合成赛”——第四届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532251),了解赛事详情。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-17] 我们利用Data-Juicer[沙盒实验室套件](https://github.com/modelscope/data-juicer/blob/main/docs/Sandbox-ZH.md),通过数据与模型间的系统性研发工作流,调优数据和模型,在[VBench](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)文生视频排行榜取得了新的榜首。相关成果已经整理发表在[论文](http://arxiv.org/abs/2407.11784)中,并且模型已在[ModelScope](https://modelscope.cn/models/Data-Juicer/Data-Juicer-T2V)和[HuggingFace](https://huggingface.co/datajuicer/Data-Juicer-T2V)平台发布。
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png)[2024-07-12] 我们的MLLM-Data精选列表已经演化为一个模型-数据协同开发的角度系统性[综述](https://arxiv.org/abs/2407.08583)。欢迎[浏览](docs/awesome_llm_data.md)或参与贡献!
- ![new](https://img.alicdn.com/imgextra/i4/O1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) [2024-06-01] ModelScope-Sora“数据导演”创意竞速——第三届Data-Juicer大模型数据挑战赛已经正式启动!立即访问[竞赛官网](https://tianchi.aliyun.com/competition/entrance/532219),了解赛事详情。
Expand Down Expand Up @@ -82,7 +83,7 @@ Data-Juicer正在积极更新和维护中,我们将定期强化和新增更多

* **数据反馈回路 & 沙盒实验室**:支持一站式数据-模型协同开发,通过[沙盒实验室](docs/Sandbox-ZH.md)快速迭代,基于数据和模型反馈回路、可视化和多维度自动评估等功能,使您更了解和改进您的数据和模型。 ![Data-in-the-loop](https://img.alicdn.com/imgextra/i2/O1CN017U7Zz31Y7XtCJ5GOz_!!6000000003012-0-tps-3640-1567.jpg)

* **效率增强**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,提高生产力。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)
* **面向生产环境**:提供高效并行化的数据处理流水线(Aliyun-PAI\Ray\Slurm\CUDA\算子融合),减少内存占用和CPU开销,支持自动化处理容错。 ![sys-perf](https://img.alicdn.com/imgextra/i4/O1CN01Sk0q2U1hdRxbnQXFg_!!6000000004300-0-tps-2438-709.jpg)

* **全面的数据处理菜谱**:为pre-training、fine-tuning、中英文等场景提供数十种[预构建的数据处理菜谱](configs/data_juicer_recipes/README_ZH.md)。 在LLaMA、LLaVA等模型上有效验证。 ![exp_llama](https://img.alicdn.com/imgextra/i2/O1CN019WtUPP1uhebnDlPR8_!!6000000006069-2-tps-2530-1005.png)

Expand Down Expand Up @@ -235,6 +236,19 @@ export DATA_JUICER_MODELS_CACHE="/path/to/another/directory/models"
export DATA_JUICER_ASSETS_CACHE="/path/to/another/directory/assets"
```

#### 灵活的编程接口
我们提供了各种层次的简单编程接口,以供用户选择:
```python
# ... init op & dataset ...

# 链式调用风格,支持单算子或算子列表
dataset = dataset.process(op)
dataset = dataset.process([op1, op2])
# 函数式编程风格,方便快速集成或脚本原型迭代
dataset = op(dataset)
dataset = op.run(dataset)
```

### 分布式数据处理

Data-Juicer 现在基于[RAY](https://www.ray.io/)实现了多机分布式数据处理。
Expand Down Expand Up @@ -278,6 +292,9 @@ dj-analyze --config configs/demo/analyzer.yaml
streamlit run app.py
```




### 构建配置文件

* 配置文件包含一系列全局参数和用于数据处理的算子列表。您需要设置:
Expand Down Expand Up @@ -380,8 +397,6 @@ Data-Juicer 被各种 LLM产品和研究工作使用,包括来自阿里云-通
Data-Juicer 感谢并参考了社区开源项目:
[Huggingface-Datasets](https://github.com/huggingface/datasets), [Bloom](https://huggingface.co/bigscience/bloom), [RedPajama](https://github.com/togethercomputer/RedPajama-Data/tree/rp_v1), [Pile](https://huggingface.co/datasets/EleutherAI/pile), [Alpaca-Cot](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM), [DeepSpeed](https://www.deepspeed.ai/), [Arrow](https://github.com/apache/arrow), [Ray](https://github.com/ray-project/ray), [Beam](https://github.com/apache/beam), [LM-Harness](https://github.com/EleutherAI/lm-evaluation-harness), [HELM](https://github.com/stanford-crfm/helm), ....



## 参考文献
如果您发现我们的工作对您的研发有帮助,请引用以下[论文](https://arxiv.org/abs/2309.02033) 。

Expand All @@ -392,4 +407,24 @@ Data-Juicer 感谢并参考了社区开源项目:
booktitle={International Conference on Management of Data},
year={2024}
}
```
```
<details>
<summary>更多Data-Juicer团队相关论文:
</summary>>

- [Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development](https://arxiv.org/abs/2407.11784)

- [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https://arxiv.org/abs/2407.08583)

- [Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining](https://arxiv.org/abs/2402.11505)

</details>



<br><hr>
<div style="text-align: right;">

[🔼 back to index](#documentation-index-a-namedocuments)

</div>
19 changes: 12 additions & 7 deletions docs/DJ_SORA.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ This project is being actively updated and maintained. We eagerly invite you to
- [✅] Ray based multi-machine distributed running
- [✅] Aliyun PAI-DLC & Slurm based multi-machine distributed running
- [✅] Distributed scheduling optimization (OP-aware, automated load balancing) --> Aliyun PAI-DLC
- [ ] [WIP] Distributed storage optimization
- [WIP] Low precision acceleration support for video related operators. (git tags: dj_op, dj_efficiency)
- [WIP] SOTA model enhancement of existing video related operators. (git tags: dj_op, dj_sota_models)

## Basic Operators (video spatio-temporal dimension)
- Towards Data Quality
Expand Down Expand Up @@ -90,20 +91,24 @@ This project is being actively updated and maintained. We eagerly invite you to
- [✅] **Youku-mPLUG-CN**: 36TB video-caption data: `{<caption, video_id>}`
- [✅] **InternVid**: 234M data sample: `{<caption, youtube_id, start/end_time>}`
- [✅] **MSR-VTT**: 10K video-caption data: `{<caption, video_id>}`
- [ ] [WIP] ModelScope's datasets integration
- [ ] VideoInstruct-100K, Panda70M, ......
- [] ModelScope's datasets integration
- [] VideoInstruct-100K, Panda70M, ......
- [ ] Large-scale high-quality DJ-SORA dataset
- [✅] (Data sandbox) Building and optimizing multimodal data recipes with DJ-video operators (which are also being continuously extended and improved).
- [ ] [WIP] Continuous expansion of data sources: open-datasets, Youku, web, ...
- [ ] [WIP] Large-scale analysis, cleaning, and generation of high-quality multimodal datasets based on DJ recipes (OpenVideos, ...)
- [ ] [WIP] Large-scale generation of 3DPatch datasets based on DJ recipes.
- [] Continuous expansion of data sources: open-datasets, Youku, web, ...
- [ ] Large-scale analysis, cleaning, and generation of high-quality multimodal datasets based on DJ recipes (OpenVideos, ...)
- [WIP] broad scenarios, high-dynamic
- ...

## DJ-SORA Data Validation and Model Training
- [ ] [WIP] (DJ-Bench101) Exploring and refining the collaborative development of multimodal data and model, establishing benchmarks and insights.
- [ ] Exploring and refining the collaborative development of multimodal data and model, establishing benchmarks and insights. [paper](https://arxiv.org/abs/2407.11784)
- [ ] [WIP] Integration of SORA-like model training pipelines
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate)
- [✅] [T2V](https://t2v-turbo.github.io/)
- [✅] [V-Bench](https://vchitect.github.io/VBench-project/)
- ...
- [✅] (Model-Data sandbox) With relatively small models and the DJ-SORA dataset, exploring low-cost, transferable, and instructive data-model co-design, configurations and checkpoints.
- [ ] [WIP] Training SORA-like models with DJ-SORA data on larger scales and in more scenarios to improve model performance.
- [✅] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V)
- ...
- ...
24 changes: 14 additions & 10 deletions docs/DJ_SORA_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、
- [✅] Ray多机分布式
- [✅] 基于阿里云PAI-DLC和Slurm的多机分布式
- [✅] 分布式调度优化(OP-aware、自动化负载均衡)--> Aliyun PAI-DLC
- [ ] [WIP] 分布式存储优化
- [WIP] 视频相关算子的低精度加速支持, git tags: dj_op, dj_efficiency
- [WIP] 现有视频相关算子的SOTA模型增强, git tags: dj_op, dj_sota_models

## 基础算子(视频时空维度)
- 面向数据质量
Expand Down Expand Up @@ -94,22 +95,25 @@ DJ-SORA将基于Data-Juicer(包含上百个专用的视频、图像、音频、
- [✅] **Youku-mPLUG-CN**: 36TB video-caption data:`{<caption, video_id>}`
- [✅] **InternVid**: 234M data sample:`{<caption, youtube_id, start/end_time>}`
- [✅] **MSR-VTT**: 10K video-caption data:`{<caption, video_id>}`
- [ ] [WIP] ModelScope数据集集成
- [ ] VideoInstruct-100K, Panda70M, ......
- [] ModelScope数据集集成
- [] VideoInstruct-100K, Panda70M, ......
- [ ] 大规模高质量DJ-SORA数据集
- [✅] (Data sandbox) 基于DJ-video算子构建和优化多模态数据菜谱 (算子同期持续完善)
- [ ] [WIP] 数据源持续扩充:open-datasets, youku, web, ...
- [ ] [WIP] 基于DJ菜谱规模化分析、清洗、生成高质量多模态数据集 (OpenVideo, ...)
- [ ] [WIP] 基于DJ菜谱形成大规模3DPatch数仓
- [] 数据源持续扩充:open-datasets, youku, web, ...
- [ ] 基于DJ菜谱规模化分析、清洗、生成高质量多模态数据集
- [WIP] 多场景、高动态
- ...

## DJ-SORA数据验证及模型训练
- [ ] [WIP] (DJ-Bench101) 探索及完善多模态数据和模型的协同开发,形成benchmark和insights
- [ ] [WIP] 类SORA模型训练pipeline集成
- [EasyAnimate](https://github.com/aigc-apps/EasyAnimate)
- [✅] 探索及完善多模态数据和模型的协同开发,形成benchmark和insights: [paper](https://arxiv.org/abs/2407.11784)
- [] [WIP] 类SORA模型训练pipeline集成
- [✅] [EasyAnimate](https://github.com/aigc-apps/EasyAnimate)
- [✅] [T2V](https://t2v-turbo.github.io/)
- [✅] [V-Bench](https://vchitect.github.io/VBench-project/)
- ...
- [✅] (Model-Data sandbox) 在相对小的模型和DJ-SORA数据集上,探索形成低开销、可迁移、有指导性的data-model co-design、配置及检查点
- [ ] [WIP] 更大规模、更多场景使用DJ-SORA数据训练类SORA模型,提高模型性能
- ...
- [✅] Data-Juicer-T2v, [V-Bench Top1 model](https://huggingface.co/datajuicer/Data-Juicer-T2V)
- ...


8 changes: 5 additions & 3 deletions docs/DeveloperGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
## Coding Style

We define our styles in `.pre-commit-config.yaml`. Before committing,
please install `pre-commit` tool to check and modify accordingly:
please install `pre-commit` tool to automatically check and modify accordingly:

```shell
# ===========install pre-commit tool===========
Expand Down Expand Up @@ -104,20 +104,22 @@ class StatsKeys(object):
return False
```

- If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `self._accelerator = 'cuda'` in the constructor, and ensure that `compute_stats` and `process` methods accept an additional positional argument `rank`.
- If Hugging Face models are used within an operator, you might want to leverage GPU acceleration. To achieve this, declare `_accelerator = 'cuda'` in the constructor, and ensure that `compute_stats` and `process` methods accept an additional positional argument `rank`.

```python
# ... (same as above)

@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):

_accelerator = 'cuda'

def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
# ... (same as above)
self._accelerator = 'cuda'

def compute_stats(self, sample, rank=None):
# ... (same as above)
Expand Down
8 changes: 5 additions & 3 deletions docs/DeveloperGuide_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@

## 编码规范

我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行规范化
我们将编码规范定义在 `.pre-commit-config.yaml` 中。在向仓库贡献代码之前,请使用 `pre-commit` 工具对代码进行自动规范化

```shell
# ===========install pre-commit tool===========
Expand Down Expand Up @@ -99,20 +99,22 @@ class StatsKeys(object):
return False
```

- 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `self._accelerator = 'cuda'`,并确保 `compute_stats` 和 `process` 方法接受一个额外的位置参数 `rank`。
- 如果在算子中使用了 Hugging Face 模型,您可能希望利用 GPU 加速。为了实现这一点,请在构造函数中声明 `_accelerator = 'cuda'`,并确保 `compute_stats` 和 `process` 方法接受一个额外的位置参数 `rank`。

```python
# ... (same as above)

@OPERATORS.register_module('text_length_filter')
class TextLengthFilter(Filter):

_accelerator = 'cuda'

def __init__(self,
min_len: PositiveInt = 10,
max_len: PositiveInt = sys.maxsize,
*args,
**kwargs):
# ... (same as above)
self._accelerator = 'cuda'

def compute_stats(self, sample, rank=None):
# ... (same as above)
Expand Down
Loading
Loading