PaddlePaddle · westfish · Mar 5, 2024 · Feb 27, 2024 · Feb 27, 2024 · Mar 1, 2024
diff --git a/ppdiffusers/examples/AnimateAnyone/README.md b/ppdiffusers/examples/AnimateAnyone/README.md
@@ -0,0 +1,54 @@
+# Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
+
+## 1. 模型简介
+
+ Animate Anyones是由阿里巴巴智能计算研究院推出的一项角色动画技术，能将静态图像依据指定动作生成动态的角色视频。该技术利用扩散模型，以保持图像到视频转换中的时间一致性和内容细节。具体实现借鉴于[MooreThreads/Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone/tree/master)。
+
+![](https://github.com/PaddlePaddle/PaddleMIX/assets/46399096/595032c0-6f76-49ba-834a-3e92e790ea2f)
+
+注：上图引自 [AnimateAnyone](https://arxiv.org/pdf/2311.17117.pdf)。
+
+## 2. 环境准备
+
+通过 `git clone` 命令拉取 PaddleMIX 源码，并安装必要的依赖库。请确保你的 PaddlePaddle 框架版本在 2.6.0 之后，PaddlePaddle 框架安装可参考 [飞桨官网-安装](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html)。
+
+```bash
+# 克隆 PaddleMIX 仓库
+git clone https://github.com/PaddlePaddle/PaddleMIX
+
+# 安装2.6.0版本的paddlepaddle-gpu，当前我们选择了cuda12.0的版本，可以查看 https://www.paddlepaddle.org.cn/ 寻找自己适合的版本
+python -m pip install paddlepaddle-gpu==2.6.0.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+
+# 进入consistency_distillation目录
+cd PaddleMIX/ppdiffusers/examples/AnimateAnyone/
+
+# 安装新版本ppdiffusers
+pip install https://paddlenlp.bj.bcebos.com/models/community/junnyu/wheels/ppdiffusers-0.24.0-py3-none-any.whl --user
+
+# 安装其他所需的依赖, 如果提示权限不够，请在最后增加 --user 选项
+pip install -r requirements.txt
+```
+
+## 3. 模型下载
+
+运行以下自动下载脚本，通过 [Huggingface](https://huggingface.co/InstantX/InstantID) 下载 AnimateAnyone 相关模型权重文件，模型权重文件将存储在`./pretrained_weights`下面。
+
+```shell
+python scripts/download_weights.py
+```
+
+## 4. 模型推理
+
+运行以下推理命令，生成指定宽高和帧数的动画，将存储在`./output`下。
+
+```shell
+python -m scripts.pose2vid --config ./configs/inference/animation.yaml -W 512 -H 784 -L 120
+```
+
+生成效果如下所示：
+<video controls autoplay loop src="https://github.com/PaddlePaddle/PaddleMIX/assets/46399096/4343b522-4449-4db2-be28-fdbbe04f90d4" muted="false"></video>
+
+## 5. 参考资料
+
+- [MooreThreads/Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone/tree/master)
+- [AnimateAnyone](https://github.com/HumanAIGC/AnimateAnyone)
diff --git a/ppdiffusers/examples/AnimateAnyone/configs/inference/animation.yaml b/ppdiffusers/examples/AnimateAnyone/configs/inference/animation.yaml
@@ -0,0 +1,15 @@
+pretrained_base_model_path: "./pretrained_weights/stable-diffusion-v1-5/"
+pretrained_vae_path: "stabilityai/sd-vae-ft-mse"
+image_encoder_path: "lambdalabs/sd-image-variations-diffusers/image_encoder"
+
+
+denoising_unet_path: "./pretrained_weights/denoising_unet.pth"
+reference_unet_path: "./pretrained_weights/reference_unet.pth"
+pose_guider_path: "./pretrained_weights/pose_guider.pth"
+motion_module_path: "./pretrained_weights/motion_module.pth"
+
+inference_config: "./configs/inference/inference_v2.yaml"
+
+test_cases:
+  "./configs/inference/ref_images/anyone-10.png":
+    - "./configs/inference/pose_videos/anyone-video-1_kps.mp4" 
diff --git a/ppdiffusers/examples/AnimateAnyone/configs/inference/inference_v2.yaml b/ppdiffusers/examples/AnimateAnyone/configs/inference/inference_v2.yaml
@@ -0,0 +1,35 @@
+unet_additional_kwargs:
+  use_inflated_groupnorm: true
+  unet_use_cross_frame_attention: false 
+  unet_use_temporal_attention: false
+  use_motion_module: true
+  motion_module_resolutions:
+  - 1
+  - 2
+  - 4
+  - 8
+  motion_module_mid_block: true 
+  motion_module_decoder_only: false
+  motion_module_type: Vanilla
+  motion_module_kwargs:
+    num_attention_heads: 8 
+    num_transformer_block: 1
+    attention_block_types:
+    - Temporal_Self
+    - Temporal_Self
+    temporal_position_encoding: true
+    temporal_position_encoding_max_len: 32
+    temporal_attention_dim_div: 1
+
+noise_scheduler_kwargs:
+  beta_start: 0.00085
+  beta_end: 0.012
+  beta_schedule: "linear"
+  clip_sample: false
+  steps_offset: 1
+  ### Zero-SNR params
+  prediction_type: "v_prediction"
+  rescale_betas_zero_snr: True
+  timestep_spacing: "trailing"
+
+sampler: DDIM 
diff --git a/ppdiffusers/examples/AnimateAnyone/configs/inference/pose_videos/anyone-video-1_kps.mp4 b/ppdiffusers/examples/AnimateAnyone/configs/inference/pose_videos/anyone-video-1_kps.mp4
diff --git a/ppdiffusers/examples/AnimateAnyone/configs/inference/ref_images/anyone-10.png b/ppdiffusers/examples/AnimateAnyone/configs/inference/ref_images/anyone-10.png
diff --git a/ppdiffusers/examples/AnimateAnyone/requirements.txt b/ppdiffusers/examples/AnimateAnyone/requirements.txt
@@ -0,0 +1,15 @@
+av==11.0.0
+einops==0.7.0
+imageio==2.33.0
+imageio-ffmpeg==0.4.9
+numpy==1.23.5
+omegaconf==2.2.3
+opencv-contrib-python==4.8.1.78
+opencv-python==4.8.1.78
+Pillow==9.5.0
+scikit-image==0.21.0
+scikit-learn==1.3.2
+scipy==1.11.4
+tqdm==4.66.1
+mlflow==2.9.2
+hf-transfer
diff --git a/ppdiffusers/examples/AnimateAnyone/scripts/download_weights.py b/ppdiffusers/examples/AnimateAnyone/scripts/download_weights.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+os.environ["HF_ENDPOINT"] = "https://hf-mirror.com"
+os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
+from pathlib import Path, PurePosixPath
+
+from huggingface_hub import hf_hub_download
+
+
+def prepare_base_model():
+    print("Preparing base stable-diffusion-v1-5 weights...")
+    local_dir = "./pretrained_weights/stable-diffusion-v1-5"
+    os.makedirs(local_dir, exist_ok=True)
+    for hub_file in ["unet/config.json", "unet/diffusion_pytorch_model.bin"]:
+        path = Path(hub_file)
+        saved_path = local_dir / path
+        if os.path.exists(saved_path):
+            continue
+        hf_hub_download(
+            repo_id="runwayml/stable-diffusion-v1-5",
+            subfolder=PurePosixPath(path.parent),
+            filename=PurePosixPath(path.name),
+            local_dir=local_dir,
+        )
+
+
+def prepare_anyone():
+    print("Preparing AnimateAnyone weights...")
+    local_dir = "./pretrained_weights"
+    os.makedirs(local_dir, exist_ok=True)
+    for hub_file in [
+        "denoising_unet.pth",
+        "motion_module.pth",
+        "pose_guider.pth",
+        "reference_unet.pth",
+    ]:
+        path = Path(hub_file)
+        saved_path = local_dir / path
+        if os.path.exists(saved_path):
+            continue
+
+        hf_hub_download(
+            repo_id="patrolli/AnimateAnyone",
+            subfolder=PurePosixPath(path.parent),
+            filename=PurePosixPath(path.name),
+            local_dir=local_dir,
+        )
+
+
+if __name__ == "__main__":
+    prepare_base_model()
+    prepare_anyone()
diff --git a/ppdiffusers/examples/AnimateAnyone/scripts/pose2vid.py b/ppdiffusers/examples/AnimateAnyone/scripts/pose2vid.py
@@ -0,0 +1,157 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from datetime import datetime
+from pathlib import Path
+
+import paddle
+from einops import repeat
+from omegaconf import OmegaConf
+from paddle.vision import transforms
+from paddlenlp.transformers import CLIPVisionModelWithProjection
+from PIL import Image
+from src.models.pose_guider import PoseGuider
+from src.models.unet_2d_condition import UNet2DConditionModel
+from src.models.unet_3d import UNet3DConditionModel
+from src.pipelines.pipeline_pose2vid_long import Pose2VideoPipeline
+from src.utils.util import get_fps, read_frames, save_video_as_mp4
+
+from ppdiffusers import AutoencoderKL, DDIMScheduler
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--config")
+    parser.add_argument("-W", type=int, default=512)
+    parser.add_argument("-H", type=int, default=784)
+    parser.add_argument("-L", type=int, default=24)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--cfg", type=float, default=3.5)
+    parser.add_argument("--steps", type=int, default=1)
+    parser.add_argument("--fps", type=int, default=30)
+    args = parser.parse_args()
+
+    return args
+
+
+def main():
+    args = parse_args()
+
+    config = OmegaConf.load(args.config)
+
+    vae = AutoencoderKL.from_pretrained(
+        config.pretrained_vae_path,
+        from_diffusers=True,
+        from_hf_hub=True,
+    )
+
+    reference_unet = UNet2DConditionModel.from_pretrained(
+        config.pretrained_base_model_path,
+        subfolder="unet",
+        from_diffusers=True,
+        from_hf_hub=True,
+    )
+
+    inference_config_path = config.inference_config
+    infer_config = OmegaConf.load(inference_config_path)
+    denoising_unet = UNet3DConditionModel.from_pretrained_2d(
+        config.pretrained_base_model_path,
+        config.motion_module_path,
+        subfolder="unet",
+        unet_additional_kwargs=infer_config.unet_additional_kwargs,
+    )
+
+    pose_guider = PoseGuider(320, block_out_channels=(16, 32, 96, 256))
+
+    image_enc = CLIPVisionModelWithProjection.from_pretrained(
+        config.image_encoder_path,
+    )
+
+    sched_kwargs = OmegaConf.to_container(infer_config.noise_scheduler_kwargs)
+    scheduler = DDIMScheduler(**sched_kwargs)
+
+    generator = paddle.Generator().manual_seed(args.seed)
+
+    width, height = args.W, args.H
+
+    pipe = Pose2VideoPipeline(
+        vae=vae,
+        image_encoder=image_enc,
+        reference_unet=reference_unet,
+        denoising_unet=denoising_unet,
+        pose_guider=pose_guider,
+        scheduler=scheduler,
+    )
+
+    pipe.load_pretrained(config)
+
+    date_str = datetime.now().strftime("%Y%m%d")
+    time_str = datetime.now().strftime("%H%M")
+    save_dir_name = f"{time_str}--seed_{args.seed}-{args.W}x{args.H}"
+
+    save_dir = Path(f"output/{date_str}/{save_dir_name}")
+    save_dir.mkdir(exist_ok=True, parents=True)
+
+    for ref_image_path in config["test_cases"].keys():
+        # Each ref_image may correspond to multiple actions
+        for pose_video_path in config["test_cases"][ref_image_path]:
+            ref_name = Path(ref_image_path).stem
+            pose_name = Path(pose_video_path).stem.replace("_kps", "")
+
+            ref_image_pil = Image.open(ref_image_path).convert("RGB")
+
+            pose_list = []
+            pose_tensor_list = []
+            pose_images = read_frames(pose_video_path)
+            src_fps = get_fps(pose_video_path)
+            print(f"pose video has {len(pose_images)} frames, with {src_fps} fps")
+            pose_transform = transforms.Compose([transforms.Resize((height, width)), transforms.ToTensor()])
+            for pose_image_pil in pose_images[: args.L]:
+                pose_tensor_list.append(pose_transform(pose_image_pil))
+                pose_list.append(pose_image_pil)
+
+            ref_image_tensor = pose_transform(ref_image_pil)  # (c, h, w)
+            ref_image_tensor = ref_image_tensor.unsqueeze(1).unsqueeze(0)  # (1, c, 1, h, w)
+            ref_image_tensor = repeat(ref_image_tensor, "b c f h w -> b c (repeat f) h w", repeat=args.L)
+
+            pose_tensor = paddle.stack(x=pose_tensor_list, axis=0)
+
+            x = pose_tensor
+            perm_0 = list(range(x.ndim))
+            perm_0[0] = 1
+            perm_0[1] = 0
+            pose_tensor = x.transpose(perm=perm_0)
+            pose_tensor = pose_tensor.unsqueeze(axis=0)
+
+            video = pipe(
+                ref_image_pil,
+                pose_list,
+                width,
+                height,
+                args.L,
+                args.steps,
+                args.cfg,
+                generator=generator,
+            ).videos
+
+            save_video_as_mp4(
+                video,
+                f"{save_dir}/{ref_name}_{pose_name}_{args.H}x{args.W}_{int(args.cfg)}_{time_str}.mp4",
+                fps=src_fps if args.fps is None else args.fps,
+            )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/ppdiffusers/examples/AnimateAnyone/src/__init__.py b/ppdiffusers/examples/AnimateAnyone/src/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.