Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetune yaml update #143

Merged
merged 5 commits into from
Aug 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 14 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,30 @@
📚 Check here to view <a href="https://arxiv.org/abs/2408.06072" target="_blank">Paper</a>
</p>
<p align="center">
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a>
👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a>
</p>
<p align="center">
📍 Visit <a href="https://chatglm.cn/video?fr=osm_cogvideox">清影</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models.
</p>

## Update and News

- 🔥🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`. Fine-tuning
- 🔥🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos generated by
CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following
the [tutorial](tools/venhancer/README_zh.md).
- 🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`.
Fine-tuning
no longer requires installing `SwissArmyTransformer` from source. Additionally, the `Tied VAE` technique has been
applied in the implementation within the `diffusers` library. Please install `diffusers` and `accelerate` libraries
from source. Inference for CogVideoX now requires only 12GB of VRAM.
from source. Inference for CogVideoX now requires only 12GB of VRAM. The inference code needs to be modified. Please
check [cli_demo](inference/cli_demo.py).
- 🔥 **News**: ```2024/8/12```: The CogVideoX paper has been uploaded to arxiv. Feel free to check out
the [paper](https://arxiv.org/abs/2408.06072).
- 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be
performed
on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
- 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can
reconstruct
the video almost losslessly.
reconstruct the video almost losslessly.
- 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**,the first model in the CogVideoX series of video
generation models.
- 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch),the **first**
Expand Down Expand Up @@ -219,14 +223,11 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
🌟 If you find our work helpful, please leave us a star and cite our paper.

```
@misc{yang2024cogvideoxtexttovideodiffusionmodels,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Xiaotao Gu and Yuxuan Zhang and Weihan Wang and Yean Cheng and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
year={2024},
eprint={2408.06072},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.06072},
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
Expand Down
28 changes: 17 additions & 11 deletions README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,24 @@
📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a> をチェック
</p>
<p align="center">
👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a> と <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> に参加
👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a> と <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> に参加
</p>
<p align="center">
📍 <a href="https://chatglm.cn/video?fr=osm_cogvideox">清影</a> と <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">APIプラットフォーム</a> を訪問して、より大規模な商用ビデオ生成モデルを体験
</p>

## 更新とニュース
- 🔥🔥 **ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`にアップグレードされました。これにより、微調整の際に`SwissArmyTransformer`をソースコードからインストールする必要がなくなりました。同時に、`Tied VAE` 技術が `diffusers` ライブラリの実装に適用されました。`diffusers` と `accelerate` ライブラリをソースコードからインストールしてください。CogVdideoX の推論には 12GB の VRAM だけが必要です。
- 🔥 **ニュース**: ```2024/8/12```: CogVideoX 論文がarxivにアップロードされました。ぜひ[論文](https://arxiv.org/abs/2408.06072)をご覧ください。

- 🔥🔥 **ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
が生成したビデオの強化をサポートしました。より高い解像度とより高品質なビデオレンダリングを実現します。[チュートリアル](tools/venhancer/README_ja.md)
に従って、ぜひお試しください。
- 🔥**ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`
にアップグレードされました。これにより、微調整の際に`SwissArmyTransformer`
をソースコードからインストールする必要がなくなりました。同時に、`Tied VAE` 技術が `diffusers`
ライブラリの実装に適用されました。`diffusers` と `accelerate` ライブラリをソースコードからインストールしてください。CogVdideoX
の推論には 12GB の VRAM だけが必要です。 推論コードの修正が必要です。[cli_demo](inference/cli_demo.py)をご確認ください。
- 🔥 **ニュース**: ```2024/8/12```: CogVideoX
論文がarxivにアップロードされました。ぜひ[論文](https://arxiv.org/abs/2408.06072)をご覧ください。
- 🔥 **ニュース**: ```2024/8/7```: CogVideoX は `diffusers` バージョン 0.30.0 に統合されました。単一の 3090 GPU
で推論を実行できます。詳細については [コード](inference/cli_demo.py) を参照してください。
- 🔥 **ニュース**: ```2024/8/6```: **CogVideoX-2B** で使用される **3D Causal VAE** もオープンソース化しました。これにより、ビデオをほぼ無損失で再構築できます。
Expand Down Expand Up @@ -211,14 +220,11 @@ CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.c
🌟 私たちの仕事が役立つと思われた場合、ぜひスターを付けていただき、論文を引用してください。

```
@misc{yang2024cogvideoxtexttovideodiffusionmodels,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Xiaotao Gu and Yuxuan Zhang and Weihan Wang and Yean Cheng and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
year={2024},
eprint={2408.06072},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.06072},
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
Expand Down
22 changes: 11 additions & 11 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,17 +15,20 @@
📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a>
</p>
<p align="center">
👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a> 和 <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a>
👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a> 和 <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a>
</p>
<p align="center">
📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验更大规模的商业版视频生成模型。
</p>

## 项目更新

- 🔥🔥 **News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
- 🔥🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
生成的视频进行增强,实现更高分辨率,更高质量的视频渲染。欢迎大家按照[教程](tools/venhancer/README_zh.md)体验使用。
- 🔥**News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
微调不再需要从源代码安装`SwissArmyTransformer`。同时,`Tied VAE` 技术已经被应用到 `diffusers`
库中的实现,请从源代码安装 `diffusers` 和 `accelerate` 库,推理 CogVdideoX 仅需 12GB显存。
库中的实现,请从源代码安装 `diffusers` 和 `accelerate` 库,推理 CogVdideoX 仅需
12GB显存。推理代码需要修改,请查看 [cli_demo](inference/cli_demo.py)
- 🔥 **News**: ```2024/8/12```: CogVideoX 论文已上传到arxiv,欢迎查看[论文](https://arxiv.org/abs/2408.06072)。
- 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers`
0.30.0版本,单张3090可以推理,详情请见[代码](inference/cli_demo.py)。
Expand Down Expand Up @@ -191,14 +194,11 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
🌟 如果您发现我们的工作有所帮助,欢迎引用我们的文章,留下宝贵的stars

```
@misc{yang2024cogvideoxtexttovideodiffusionmodels,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Xiaotao Gu and Yuxuan Zhang and Weihan Wang and Yean Cheng and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
year={2024},
eprint={2408.06072},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.06072},
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
Expand Down
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
diffusers==0.30.0
git+https://github.com/huggingface/diffusers.git@main#egg=diffusers
transformers==4.44.0
accelerate==0.33.0
git+https://github.com/huggingface/accelerate.git@main#egg=accelerate
sentencepiece==0.2.0 # T5
SwissArmyTransformer==0.4.12 # Inference
torch==2.4.0 # Tested in 2.2 2.3 2.4 and 2.5
Expand Down
160 changes: 65 additions & 95 deletions sat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

[日本語で読む](./README_ja.md)


This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
fine-tuning code for SAT weights.

Expand Down Expand Up @@ -69,110 +68,49 @@ loading it into Deepspeed in Finetune.
0 directories, 8 files
```

3. Modify the file `configs/cogvideox_2b_infer.yaml`.

```yaml
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path

conditioner_config:
target: sgm.modules.GeneralConditioner
params:
emb_models:
- is_trainable: false
input_key: txt
ucg_rate: 0.1
target: sgm.modules.encoders.modules.FrozenT5Embedder
params:
model_dir: "google/t5-v1_1-xxl" ## T5 model path
max_length: 226

first_stage_config:
target: sgm.models.autoencoder.VideoAutoencoderInferenceWrapper
params:
cp_size: 1
ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE model path
```

+ If using txt to save multiple prompts, please refer to `configs/test.txt` for modification. One prompt per line. If
you don't know how to write prompts, you can first use [this code](../inference/convert_demo.py) to call LLM for
refinement.
+ If using the command line as input, modify

```yaml
input_type: cli
```

so that prompts can be entered from the command line.

If you want to change the output video directory, you can modify:

```yaml
output_dir: outputs/
```

The default is saved in the `.outputs/` folder.

4. Run the inference code to start inference

```shell
bash inference.sh
```
Each text file shares the same name as its corresponding video, serving as the label for that video. Videos and labels
should be matched one-to-one. Generally, a single video should not be associated with multiple labels.

## Fine-Tuning the Model
For style fine-tuning, please prepare at least 50 videos and labels with similar styles to ensure proper fitting.

### Preparing the Dataset
### Modifying Configuration Files

The dataset format should be as follows:
We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Please note that both methods only fine-tune
the `transformer` part and do not modify the `VAE` section. `T5` is used solely as an Encoder. Please modify
the `configs/sft.yaml` (for full-parameter fine-tuning) file as follows:

```
.
├── labels
│   ├── 1.txt
│   ├── 2.txt
│   ├── ...
└── videos
├── 1.mp4
├── 2.mp4
├── ...
```

Each txt file should have the same name as its corresponding video file and contain the labels for that video. Each
video should have a one-to-one correspondence with a label. Typically, a video should not have multiple labels.

For style fine-tuning, please prepare at least 50 videos and labels with similar styles to facilitate fitting.

### Modifying the Configuration File

We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to
the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.

the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.

```yaml
# checkpoint_activations: True ## using gradient checkpointing (both checkpoint_activations in the configuration file need to be set to True)
# checkpoint_activations: True ## Using gradient checkpointing (Both checkpoint_activations in the config file need to be set to True)
model_parallel_size: 1 # Model parallel size
experiment_name: lora-disney # Experiment name (do not change)
mode: finetune # Mode (do not change)
load: "{your_CogVideoX-2b-sat_path}/transformer" # Transformer model path
no_load_rng: True # Whether to load the random seed
train_iters: 1000 # Number of training iterations
eval_iters: 1 # Number of evaluation iterations
eval_interval: 100 # Evaluation interval
eval_batch_size: 1 # Batch size for evaluation
experiment_name: lora-disney # Experiment name (do not modify)
mode: finetune # Mode (do not modify)
load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
no_load_rng: True # Whether to load random seed
train_iters: 1000 # Training iterations
eval_iters: 1 # Evaluation iterations
eval_interval: 100 # Evaluation interval
eval_batch_size: 1 # Evaluation batch size
save: ckpts # Model save path
save_interval: 100 # Model save interval
log_interval: 20 # Log output interval
train_data: [ "your train data path" ]
valid_data: [ "your val data path" ] # Training and validation sets can be the same
split: 1,0,0 # Ratio of training, validation, and test sets
num_workers: 8 # Number of worker threads for data loading
force_train: True # Allow missing keys when loading ckpt (refer to T5 and VAE which are loaded independently)
only_log_video_latents: True # Avoid using VAE decoder when eval to save memory
valid_data: [ "your val data path" ] # Training and validation datasets can be the same
split: 1,0,0 # Training, validation, and test set ratio
num_workers: 8 # Number of worker threads for data loader
force_train: True # Allow missing keys when loading checkpoint (T5 and VAE are loaded separately)
only_log_video_latents: True # Avoid memory overhead caused by VAE decode
deepspeed:
bf16:
enabled: False # For CogVideoX-2B set to False and for CogVideoX-5B set to True
fp16:
enabled: True # For CogVideoX-2B set to True and for CogVideoX-5B set to False
```

If you wish to use Lora fine-tuning, you also need to modify:
If you wish to use Lora fine-tuning, you also need to modify the `cogvideox_<model_parameters>_lora` file:

Here, take `CogVideoX-2B` as a reference:

```yaml
```
model:
scale_factor: 1.15258426
disable_first_stage_autocast: true
Expand All @@ -186,15 +124,47 @@ model:
r: 256
```

### Fine-Tuning and Validation
### Modifying Run Scripts

1. Run the inference code to start fine-tuning.
Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` to select the configuration file. Below are two examples:

```shell
1. If you want to use the `CogVideoX-2B` model and the `Lora` method, you need to modify `finetune_single_gpu.sh`
or `finetune_multi_gpus.sh`:

```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
```

2. If you want to use the `CogVideoX-2B` model and the `full-parameter fine-tuning` method, you need to
modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh`:

```
run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
```

### Fine-Tuning and Evaluation

Run the inference code to start fine-tuning.

```
bash finetune_single_gpu.sh # Single GPU
bash finetune_multi_gpus.sh # Multi GPUs
```

### Using the Fine-Tuned Model

The fine-tuned model cannot be merged; here is how to modify the inference configuration file `inference.sh`:

```
run_cmd="$environs python sample_video.py --base configs/cogvideox_<model_parameters>_lora.yaml configs/inference.yaml --seed 42"
```

Then, execute the code:

```
bash inference.sh
```

### Converting to Huggingface Diffusers Supported Weights

The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run:
Expand Down
Loading