THUDM · zRzRzRzRzRzRzR · Aug 20, 2024 · Aug 15, 2024 · Aug 15, 2024 · Aug 19, 2024
diff --git a/README.md b/README.md
@@ -14,26 +14,30 @@
 📚 Check here to view <a href="https://arxiv.org/abs/2408.06072" target="_blank">Paper</a>
 </p>
 <p align="center">
-    👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 
+    👋 Join our <a href="resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> 
 </p>
 <p align="center">
 📍 Visit <a href="https://chatglm.cn/video?fr=osm_cogvideox">清影</a> and <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">API Platform</a> to experience larger-scale commercial video generation models.
 </p>
 
 ## Update and News
 
-- 🔥🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`. Fine-tuning
+- 🔥🔥 **News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) now supports enhancing videos generated by
+  CogVideoX, achieving higher resolution and higher quality video rendering. We welcome you to try it out by following
+  the [tutorial](tools/venhancer/README_zh.md).
+- 🔥 **News**: ```2024/8/15```: The `SwissArmyTransformer` dependency in CogVideoX has been upgraded to `0.4.12`.
+  Fine-tuning
   no longer requires installing `SwissArmyTransformer` from source. Additionally, the `Tied VAE` technique has been
   applied in the implementation within the `diffusers` library. Please install `diffusers` and `accelerate` libraries
-  from source. Inference for CogVideoX now requires only 12GB of VRAM.
+  from source. Inference for CogVideoX now requires only 12GB of VRAM. The inference code needs to be modified. Please
+  check [cli_demo](inference/cli_demo.py).
 - 🔥 **News**: ```2024/8/12```: The CogVideoX paper has been uploaded to arxiv. Feel free to check out
   the [paper](https://arxiv.org/abs/2408.06072).
 - 🔥 **News**: ```2024/8/7```: CogVideoX has been integrated into `diffusers` version 0.30.0. Inference can now be
   performed
   on a single 3090 GPU. For more details, please refer to the [code](inference/cli_demo.py).
 - 🔥 **News**: ```2024/8/6```: We have also open-sourced **3D Causal VAE** used in **CogVideoX-2B**, which can
-  reconstruct
-  the video almost losslessly.
+  reconstruct the video almost losslessly.
 - 🔥 **News**: ```2024/8/6```: We have open-sourced **CogVideoX-2B**，the first model in the CogVideoX series of video
   generation models.
 - 🌱 **Source**: ```2022/5/19```: We have open-sourced **CogVideo** (now you can see in `CogVideo` branch)，the **first**
@@ -219,14 +223,11 @@ hands-on practice on text-to-video generation. *The original input is in Chinese
 🌟 If you find our work helpful, please leave us a star and cite our paper.
 
 ```
-@misc{yang2024cogvideoxtexttovideodiffusionmodels,
-      title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, 
-      author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Xiaotao Gu and Yuxuan Zhang and Weihan Wang and Yean Cheng and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
-      year={2024},
-      eprint={2408.06072},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2408.06072}, 
+@article{yang2024cogvideox,
+  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
+  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
+  journal={arXiv preprint arXiv:2408.06072},
+  year={2024}
 }
 @article{hong2022cogvideo,
   title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},

diff --git a/README_ja.md b/README_ja.md
@@ -14,15 +14,24 @@
 📚 <a href="https://arxiv.org/abs/2408.06072" target="_blank">論文</a> をチェック
 </p>
 <p align="center">
-    👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a> と <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> に参加
+    👋 <a href="resources/WECHAT.md" target="_blank">WeChat</a> と <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> に参加
 </p>
 <p align="center">
 📍 <a href="https://chatglm.cn/video?fr=osm_cogvideox">清影</a> と <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9">APIプラットフォーム</a> を訪問して、より大規模な商用ビデオ生成モデルを体験
 </p>
 
 ## 更新とニュース
-- 🔥🔥 **ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`にアップグレードされました。これにより、微調整の際に`SwissArmyTransformer`をソースコードからインストールする必要がなくなりました。同時に、`Tied VAE` 技術が `diffusers` ライブラリの実装に適用されました。`diffusers` と `accelerate` ライブラリをソースコードからインストールしてください。CogVdideoX の推論には 12GB の VRAM だけが必要です。
-- 🔥 **ニュース**: ```2024/8/12```: CogVideoX 論文がarxivにアップロードされました。ぜひ[論文](https://arxiv.org/abs/2408.06072)をご覧ください。
+
+- 🔥🔥 **ニュース**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) は CogVideoX
+  が生成したビデオの強化をサポートしました。より高い解像度とより高品質なビデオレンダリングを実現します。[チュートリアル](tools/venhancer/README_ja.md)
+  に従って、ぜひお試しください。
+- 🔥**ニュース**: 2024/8/15: CogVideoX の依存関係である`SwissArmyTransformer`の依存が`0.4.12`
+  にアップグレードされました。これにより、微調整の際に`SwissArmyTransformer`
+  をソースコードからインストールする必要がなくなりました。同時に、`Tied VAE` 技術が `diffusers`
+  ライブラリの実装に適用されました。`diffusers` と `accelerate` ライブラリをソースコードからインストールしてください。CogVdideoX
+  の推論には 12GB の VRAM だけが必要です。 推論コードの修正が必要です。[cli_demo](inference/cli_demo.py)をご確認ください。
+- 🔥 **ニュース**: ```2024/8/12```: CogVideoX
+  論文がarxivにアップロードされました。ぜひ[論文](https://arxiv.org/abs/2408.06072)をご覧ください。
 - 🔥 **ニュース**: ```2024/8/7```: CogVideoX は `diffusers` バージョン 0.30.0 に統合されました。単一の 3090 GPU
   で推論を実行できます。詳細については [コード](inference/cli_demo.py) を参照してください。
 - 🔥 **ニュース**: ```2024/8/6```: **CogVideoX-2B** で使用される **3D Causal VAE** もオープンソース化しました。これにより、ビデオをほぼ無損失で再構築できます。
@@ -211,14 +220,11 @@ CogVideoのデモは [https://models.aminer.cn/cogvideo](https://models.aminer.c
 🌟 私たちの仕事が役立つと思われた場合、ぜひスターを付けていただき、論文を引用してください。
 
 ```
-@misc{yang2024cogvideoxtexttovideodiffusionmodels,
-      title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, 
-      author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Xiaotao Gu and Yuxuan Zhang and Weihan Wang and Yean Cheng and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
-      year={2024},
-      eprint={2408.06072},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2408.06072}, 
+@article{yang2024cogvideox,
+  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
+  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
+  journal={arXiv preprint arXiv:2408.06072},
+  year={2024}
 }
 @article{hong2022cogvideo,
   title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},

diff --git a/README_zh.md b/README_zh.md
@@ -15,17 +15,20 @@
 📚 查看 <a href="https://arxiv.org/abs/2408.06072" target="_blank">论文</a>
 </p>
 <p align="center">
-    👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a> 和  <a href="https://discord.gg/Ewaabk6s" target="_blank">Discord</a> 
+    👋 加入我们的 <a href="resources/WECHAT.md" target="_blank">微信</a> 和  <a href="https://discord.gg/B94UfuhN" target="_blank">Discord</a> 
 </p>
 <p align="center">
 📍 前往<a href="https://chatglm.cn/video?fr=osm_cogvideox"> 清影</a> 和 <a href="https://open.bigmodel.cn/?utm_campaign=open&_channel_track_key=OWTVNma9"> API平台</a> 体验更大规模的商业版视频生成模型。
 </p>
 
 ## 项目更新
 
-- 🔥🔥 **News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
+- 🔥🔥**News**: ```2024/8/20```: [VEnhancer](https://github.com/Vchitect/VEnhancer) 已经支持对 CogVideoX
+  生成的视频进行增强，实现更高分辨率，更高质量的视频渲染。欢迎大家按照[教程](tools/venhancer/README_zh.md)体验使用。
+- 🔥**News**: ```2024/8/15```: CogVideoX 依赖中`SwissArmyTransformer`依赖升级到`0.4.12`,
   微调不再需要从源代码安装`SwissArmyTransformer`。同时，`Tied VAE` 技术已经被应用到 `diffusers`
-  库中的实现，请从源代码安装 `diffusers` 和 `accelerate` 库，推理 CogVdideoX 仅需 12GB显存。
+  库中的实现，请从源代码安装 `diffusers` 和 `accelerate` 库，推理 CogVdideoX 仅需
+  12GB显存。推理代码需要修改，请查看 [cli_demo](inference/cli_demo.py)
 - 🔥 **News**: ```2024/8/12```: CogVideoX 论文已上传到arxiv，欢迎查看[论文](https://arxiv.org/abs/2408.06072)。
 - 🔥 **News**: ```2024/8/7```: CogVideoX 已经合并入 `diffusers`
   0.30.0版本，单张3090可以推理，详情请见[代码](inference/cli_demo.py)。
@@ -191,14 +194,11 @@ CogVideo的demo网站在[https://models.aminer.cn/cogvideo](https://models.amine
 🌟 如果您发现我们的工作有所帮助，欢迎引用我们的文章，留下宝贵的stars
 
 ```
-@misc{yang2024cogvideoxtexttovideodiffusionmodels,
-      title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer}, 
-      author={Zhuoyi Yang and Jiayan Teng and Wendi Zheng and Ming Ding and Shiyu Huang and Jiazheng Xu and Yuanming Yang and Wenyi Hong and Xiaohan Zhang and Guanyu Feng and Da Yin and Xiaotao Gu and Yuxuan Zhang and Weihan Wang and Yean Cheng and Ting Liu and Bin Xu and Yuxiao Dong and Jie Tang},
-      year={2024},
-      eprint={2408.06072},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2408.06072}, 
+@article{yang2024cogvideox,
+  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
+  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
+  journal={arXiv preprint arXiv:2408.06072},
+  year={2024}
 }
 @article{hong2022cogvideo,
   title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},

diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,6 @@
-diffusers==0.30.0
+git+https://github.com/huggingface/diffusers.git@main#egg=diffusers
 transformers==4.44.0
-accelerate==0.33.0
+git+https://github.com/huggingface/accelerate.git@main#egg=accelerate
 sentencepiece==0.2.0 # T5
 SwissArmyTransformer==0.4.12 # Inference
 torch==2.4.0 # Tested in 2.2 2.3 2.4 and 2.5

diff --git a/sat/README.md b/sat/README.md
@@ -4,7 +4,6 @@
 
 [日本語で読む](./README_ja.md)
 
-
 This folder contains the inference code using [SAT](https://github.com/THUDM/SwissArmyTransformer) weights and the
 fine-tuning code for SAT weights.
 
@@ -69,110 +68,49 @@ loading it into Deepspeed in Finetune.
 0 directories, 8 files
 ```
 
-3. Modify the file `configs/cogvideox_2b_infer.yaml`.
-
-```yaml
-load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
-
-conditioner_config:
-  target: sgm.modules.GeneralConditioner
-  params:
-    emb_models:
-      - is_trainable: false
-        input_key: txt
-        ucg_rate: 0.1
-        target: sgm.modules.encoders.modules.FrozenT5Embedder
-        params:
-          model_dir: "google/t5-v1_1-xxl" ## T5 model path
-          max_length: 226
-
-first_stage_config:
-  target: sgm.models.autoencoder.VideoAutoencoderInferenceWrapper
-  params:
-    cp_size: 1
-    ckpt_path: "{your_CogVideoX-2b-sat_path}/vae/3d-vae.pt" ## VAE model path
-```
-
-+ If using txt to save multiple prompts, please refer to `configs/test.txt` for modification. One prompt per line. If
-  you don't know how to write prompts, you can first use [this code](../inference/convert_demo.py) to call LLM for
-  refinement.
-+ If using the command line as input, modify
-
-```yaml
-input_type: cli
-```
-
-so that prompts can be entered from the command line.
-
-If you want to change the output video directory, you can modify:
-
-```yaml
-output_dir: outputs/
-```
-
-The default is saved in the `.outputs/` folder.
-
-4. Run the inference code to start inference
-
-```shell
-bash inference.sh
-```
+Each text file shares the same name as its corresponding video, serving as the label for that video. Videos and labels
+should be matched one-to-one. Generally, a single video should not be associated with multiple labels.
 
-## Fine-Tuning the Model
+For style fine-tuning, please prepare at least 50 videos and labels with similar styles to ensure proper fitting.
 
-### Preparing the Dataset
+### Modifying Configuration Files
 
-The dataset format should be as follows:
+We support two fine-tuning methods: `Lora` and full-parameter fine-tuning. Please note that both methods only fine-tune
+the `transformer` part and do not modify the `VAE` section. `T5` is used solely as an Encoder. Please modify
+the `configs/sft.yaml` (for full-parameter fine-tuning) file as follows:
 
 ```
-.
-├── labels
-│   ├── 1.txt
-│   ├── 2.txt
-│   ├── ...
-└── videos
-    ├── 1.mp4
-    ├── 2.mp4
-    ├── ...
-```
-
-Each txt file should have the same name as its corresponding video file and contain the labels for that video. Each
-video should have a one-to-one correspondence with a label. Typically, a video should not have multiple labels.
-
-For style fine-tuning, please prepare at least 50 videos and labels with similar styles to facilitate fitting.
-
-### Modifying the Configuration File
-
-We support both `Lora` and `full-parameter fine-tuning` methods. Please note that both fine-tuning methods only apply to
-the `transformer` part. The `VAE part` is not modified. `T5` is only used as an Encoder.
-
-the `configs/cogvideox_2b_sft.yaml` (for full fine-tuning) as follows.
-
-```yaml
-  # checkpoint_activations: True ## using gradient checkpointing (both checkpoint_activations in the configuration file need to be set to True)
+  # checkpoint_activations: True ## Using gradient checkpointing (Both checkpoint_activations in the config file need to be set to True)
   model_parallel_size: 1 # Model parallel size
-  experiment_name: lora-disney  # Experiment name (do not change)
-  mode: finetune # Mode (do not change)
-  load: "{your_CogVideoX-2b-sat_path}/transformer" # Transformer model path
-  no_load_rng: True # Whether to load the random seed
-  train_iters: 1000 # Number of training iterations
-  eval_iters: 1 # Number of evaluation iterations
-  eval_interval: 100 # Evaluation interval
-  eval_batch_size: 1 # Batch size for evaluation
+  experiment_name: lora-disney  # Experiment name (do not modify)
+  mode: finetune # Mode (do not modify)
+  load: "{your_CogVideoX-2b-sat_path}/transformer" ## Transformer model path
+  no_load_rng: True # Whether to load random seed
+  train_iters: 1000 # Training iterations
+  eval_iters: 1 # Evaluation iterations
+  eval_interval: 100    # Evaluation interval
+  eval_batch_size: 1  # Evaluation batch size
   save: ckpts # Model save path
   save_interval: 100 # Model save interval
   log_interval: 20 # Log output interval
   train_data: [ "your train data path" ]
-  valid_data: [ "your val data path" ] # Training and validation sets can be the same
-  split: 1,0,0 # Ratio of training, validation, and test sets
-  num_workers: 8 # Number of worker threads for data loading
-  force_train: True # Allow missing keys when loading ckpt (refer to T5 and VAE which are loaded independently)
-  only_log_video_latents: True # Avoid using VAE decoder when eval to save memory
+  valid_data: [ "your val data path" ] # Training and validation datasets can be the same
+  split: 1,0,0 # Training, validation, and test set ratio
+  num_workers: 8 # Number of worker threads for data loader
+  force_train: True # Allow missing keys when loading checkpoint (T5 and VAE are loaded separately)
+  only_log_video_latents: True # Avoid memory overhead caused by VAE decode
+  deepspeed:
+    bf16:
+      enabled: False # For CogVideoX-2B set to False and for CogVideoX-5B set to True
+    fp16:
+      enabled: True  # For CogVideoX-2B set to True and for CogVideoX-5B set to False
 ```
 
-If you wish to use Lora fine-tuning, you also need to modify:
+If you wish to use Lora fine-tuning, you also need to modify the `cogvideox_<model_parameters>_lora` file:
+
+Here, take `CogVideoX-2B` as a reference:
 
-```yaml
+```
 model:
   scale_factor: 1.15258426
   disable_first_stage_autocast: true
@@ -186,15 +124,47 @@ model:
       r: 256
 ```
 
-### Fine-Tuning and Validation
+### Modifying Run Scripts
 
-1. Run the inference code to start fine-tuning.
+Edit `finetune_single_gpu.sh` or `finetune_multi_gpus.sh` to select the configuration file. Below are two examples:
 
-```shell
+1. If you want to use the `CogVideoX-2B` model and the `Lora` method, you need to modify `finetune_single_gpu.sh`
+   or `finetune_multi_gpus.sh`:
+
+```
+run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b_lora.yaml configs/sft.yaml --seed $RANDOM"
+```
+
+2. If you want to use the `CogVideoX-2B` model and the `full-parameter fine-tuning` method, you need to
+   modify `finetune_single_gpu.sh` or `finetune_multi_gpus.sh`:
+
+```
+run_cmd="torchrun --standalone --nproc_per_node=8 train_video.py --base configs/cogvideox_2b.yaml configs/sft.yaml --seed $RANDOM"
+```
+
+### Fine-Tuning and Evaluation
+
+Run the inference code to start fine-tuning.
+
+```
 bash finetune_single_gpu.sh # Single GPU
 bash finetune_multi_gpus.sh # Multi GPUs
 ```
 
+### Using the Fine-Tuned Model
+
+The fine-tuned model cannot be merged; here is how to modify the inference configuration file `inference.sh`:
+
+```
+run_cmd="$environs python sample_video.py --base configs/cogvideox_<model_parameters>_lora.yaml configs/inference.yaml --seed 42"
+```
+
+Then, execute the code:
+
+```
+bash inference.sh 
+```
+
 ### Converting to Huggingface Diffusers Supported Weights
 
 The SAT weight format is different from Huggingface's weight format and needs to be converted. Please run: